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The  Fourteenth  Image  Understanding  (IU)  Workshop  sponsored  by  the  Defense  Advanced  Research 
Projects  Agency,  Information  Processing  Technioues  Office  was  held  in  Arlington,  Virginia,  on  June  23rd, 
1983.  The  workshop  was  conducted  as  a  full  day  session  of  the  Computer  Vision  and  Pattern  Recognition 
Conference  presented  by  the  Computer  Society  of  the  IEEE. 

Commander  Ronald  S.  Ohlander,  USN,  the  Intelligent  Systems  Program  Manager  for  the  DARPA/IPTO, 
welcomed  the  large  audience  consisting  of  research  personnel  involved  in  the  Image  Understanding  Program, 
Covernment  personnel  from  various  departments  and  agencies,  and  attendees  from  the  CVPR  conference 
interested  in  the  research  efforts  ongoing  in  this  DARPA  sponsored  program.  He  noted  that  the  existence 
of  so  large  and  varied  a  conference  as  the  CVPR,  which  has  covered  two  days  of  tutorials  and  three  days 
of  general  sessions  as  well  as  this  workshop,  indicates  the  high  level  of  interest  and  wide  variety  of 
mature  research  now  ongoing  in  the  Image  Processing  field.  This  is  the  second  time  that  DARPA  has 
coordinated  its  IU  workshop  with  a  professional  society  active  in  the  field,  remarked  CDR  Ohlander,  the 
first  being  a  joint  meeting  in  April  1981  with  the  Society  of  Photo  Optical  Instrumentation  Engineers 
(SPIE) .  CDR  Ohlander  indicated  that  the  growing  body  of  highly  sophisticated  researchers,  particularly 
in  the  Universities  but  also  in  the  general  industrial  community,  was  a  paramount  factor  in  the  growing 
usefulness  of  IU  science  in  both  military  and  non  military  fields  of  endeavor.  This  combined  meeting, 
he  concluded,  is  an  excellent  opportunity  for  users  and  theoreticians  to  interact  to  the  mutual  benefit 
of  both  groups. 

The  morning  and  first  part  of  the  afternoon  session  of  the  workshop  comprised  thirteen  technical 
reports.  These  reports  were  selected  by  the  principal  investigators  as  representing  an  interesting  facet 
of  their  research  programs.  Due  to  the  press  of  time,  each  organization  involved  in  the  program  was 
limited  to  only  one  presentation.  However,  in  order  to  provide  as  complete  a  record  as  possible  for  use  of 
government  sponsors,  all  reports  produced  by  the  various  researchers  in  the  DARPA  Program  are  included  in 
this  proceedings.  A  few  reports  were  presented  at  other  sessions  of  the  CVPR  Conference  and  are  there¬ 
fore  published  in  the  CVPR  proceedings  as  well  as  in  this  volume. 

The  remainder  of  the  workshop  consisted  of  a  panel  discussion  on  the  topic  of,  "Most  important 
problems  to  be  addressed  in  IU  over  the  next  few  years".  This  subject  was  included  in  order  to  elicit 
comments  from  the  wide  experience  available  in  the  audience  as  well  as  the  expertise  of  the  panel  discus¬ 
sants  . 


This  proceedings  has  been  supplied  to  the  Defense  Technical  Information  Center  (DTIC)  and  copies 
may  be  secured  from  that  Agency  by  writing  to  the  following  address: 

Defense  Technical  Information  Center 
Cameron  Station,  Bldg.  #5 
Alexandria,  Virginia  22314 

A  small  charge  is  assessed  by  the  DTIC  for  reproduction  expenses.  Accession  number  for  this 
proceedings  is  not  yet  available  but  will  be  assigned  by  the  DTIC  within  the  next  thirty  days.  Accession 
number  for  previous  issues  are  listed  on  the  following  page. 

The  materials  for  the  cover  of  this  proceedings  were  supplied  by  Dr.  Martin  Herman  of  Carnegie- 
Mellon  Diversity.  Dr.  Herman  described  the  meaning  of  the  process  with  this  description: 

The  layout  shows  the  flow  of  events  in  the  3D  Mosaic  scene  understanding  system. 

The  stereo  aerial  photographs  show  part  of  Washington,  D.  C.  The  3D  wire-frame 
description  of  the  scene  was  produced  by  a  process  that  extracted  and  matched 
junctions  from  the  images.  A  geometric  modelling  process  then  converted  the  wire 
frames  into  a  surface-based  description  of  the  scene.  The  reconstructed  buildings 


i 


are  shown  in  the  two  bottom  pictures.  In  the  picture  on  the  lower  right,  gray  scale 
obtained  from  one  of  the  top  images  is  mapped  onto  the  faces  of  the  buildings.  The 
stereo  reconstruction  process  represents  one  step  in  the  3D  Mosaic  system,  which 
obtains  a  more  complete  description  of  the  scene  by  incrementally  accumulating  infor¬ 
mation  derived  from  multiple  viewpoints.  The  researchers  on  this  project  include 
Dr.  Martin  Herman,  Dr.  lakeo  Kanade ,  Mr.  Shigeru  Kuroe,  and  Mr.  Duane  Williams. 

A  more  complete  description  may  be  found  in  Dr.  Herman's  paper,  "Monocular  Reconstruction  of  a 
Complex  Urban  Scene  in  the  3D  MOSAIC  System",  reproducted  in  section  III  of  this  proceedings. 

Mr.  Tom  Dickerson  of  Science  Applications,  Inc.  was  responsible  for  the  artwork  and  lay-out  for 
the  proceedings  cover.  Appreciation  is  also  due  Ms.  Neville  Worthington  of  Science  Applications,  Inc. 
for  her  assistance  with  arrangements,  and  particularly  for  typing  support  and  in  putting  together  this 
proceedings.  Finally,  our  thanks  to  t lie  Computer  Society,  IEEE,  for  their  cooperation  and  assistance 
during  the  planning  and  execution  for  the  conference  and  workshop.  Particularly  helpful  were  Mr.  Harry 
Hayman  and  Ms.  Jerry  Katz  of  IEEE  and  Dr.  Takeo  Kanade  of  Carnegie-Mellon  University,  the  conference 
chairman. 

Lee  S.  Baumann 

Science  Applications,  Inc. 

Workshop  Organizer 


AUTHOR  *  NDEX 


NAME 

PAGE 

Adiv,  G. 

285 

Anandan,  P. 

233 

Baker,  H.  H. 

327 

Ballard,  D.  H. 

43,  64 

Barnard,  S.  T. 

282 

Binford,  T.  0. 

28,  203, 

Blicher,  A.P. 

293 

Bogdanowicz,  J.  F. 

156 

Belles,  R.  C. 

224 

Brown,  C.  M. 

43 

Bullock,  B.  L. 

156 

Close,  D.  H. 

156 

Cornelius,  N. 

257 

Davis,  L.  S. 

32,  61 

Edwards,  G.  R. 

156 

Etchells ,  R.  D. 

137 

Feldman,  J.  A. 

43 

Fischler,  M.  A. 

24,  224 

Foster,  C. 

336 

Glazer,  F. 

233 

Goad,  C. 

94 

Griffith,  J.  S. 

193 

Hanson,  A.  R. 

37,  193 

Haralick,  R.  M. 


84,  304 


AUTHOR  INDEX  (Cont’d) 


NAME 


PAGE 


He  rman ,  M . 
Hrechanyk,  L.  M. 
Jones,  G.  R. 
Kanade,  T. 

Kass,  M. 

Render,  J.  R. 
Keirsey,  D.  M. 
Ketonen,  J. 
Laffey,  T.  J. 
Laws,  K.  I. 

Lawton,  D.  T. 

Lee,  H.  Y. 

Lee,  J.  S. 

Lowe ,  D .  G . 
Levitan,  S. 

Malik,  J. 

McKeown,  D.  M.  Jr. 
Medioni ,  G.  G. 
Meller,  J.  F. 
Nevatia,  R. 

Nudd,  G.  R. 

Parks,  H.  A. 


318 

64 

163 

1,  210,  257 
54 

8,  49,  249 
156 
182 
304 
148 

77,  266,  336 

298 

84 

203 

336 

327 

105 

128 

327 

47,  128 

137 

156 


Partridge,  D.  R. 


156 


AUTHOR  INDEX  (Cont'd) 


NAME  PAGE 


Pentland,  A.  P.  184,  282 

Poggio,  T.  11 

Preyss,  E.  P.  156 

Reynolds,  G.  233 

Rieger,  J.  H.  77 

Riseman,  E.  M.  37,  193 

Rosenfeld,  A.  32,  61 

Scott,  R.  219 

Schachter,  B.  J.  163 

Shafer,  S.  A.  210 

Smith,  G.  B.  243 

Tisdale,  G.  E.  163 

Tseng,  D.  Y.  156 

Ullman,  S.  11 

Vilnrotter,  F.  M.  156 

Watson,  L.  T.  314 

Waxman,  A.  N.  175 

Weems,  C.  336 

Weymouth,  T.  E.  193 

Wohn,  K.  61 


Xie,  H. 


61 


Imago  Understanding  Research  at  CMU 

1'akco  kanade 


Computer  Science  Department 
Carnegie- Mellon  University 
Pittsburgh.  PA  15213 


The  goals  of  Image  Uiiiimiaiuling  Research  at  CMU  have  been  to 
develop  hash  theory  for  understanding  3-diinciisimiol  shapes  and  to 
demonstrate  an  integrated  system  for  photo  interpretation  (database  and 
inieroelivc/anlomalie  image  interpretation  tcehnitpies).  For  these  goals 
nc  have  been  working  in  three  subareas:  I)  Incremental  31)  Mosaic 
System:  3)  Theory  for  Shape  Understanding:  and  3)  MATS.  This  report 
reviews  oar  progress  since  the  September  /  os 3  wot  ks/top  proceedings. 

1.  Incremental  3D  Mosaic  System 

The  Incremental  ul)  Mosaic  system  acquires  a  31)  surface-based 
description  (or  model)  of  a  complex  urban  scene  by  incrementally 
accumulating  information  derived  from  multiple  viewpoints.  Since  our 
report  in  the  September  1982  proceedings  |llcrmun.  Kanade.  and  Kuroe 
82],  we  have  made  significant  progress  in  two  components  of  the  system: 
the  component  that  merges  information  front  a  new  view  into  the 
current  model,  and  the  component  that  performs  monocular  analysis  of 
an  image. 

As  shown  in  Figure  I.  each  view  of  a  given  scene  (which  may  be 
either  a  single  image  or  .i  stereo  pair)  undergoes  anal., sis  which  results  in 
a  31)  wire-frame  description  that  represents  portions  of  edges  and 
vertices  of  spatial  stnietuies  such  .is  buildings.  In  order  to  update  the 
current  scene  model  (which  has  been  obtained  from  previous  views),  the 
wire- frame  description  from  the  current  view  must  be  matched  with  and 
merged  into  the  current  model.  The  matching  step  provides  the 
coordinate  transformation  from  the  wire  frames  to  the  model  and 
provides  corresponding  edges  and  vertices  in  the  two.  The  combined 
result  must  then  be  converted  into  a  new  model. 

The  merging  step  works  as  follows.  Two  objects  one  in  the  wire¬ 
frame  description  and  the  other  in  the  model,  are  merged  by  first 
merging  their  corresponding  pairs  of  edges  and  vertices  into  single 
elements  by  weighted  averages  of  their  positions.  Next  hypothesized 
elements  (faces,  edges,  or  vertices)  in  the  model  that  arc  inconsistent 
with  modified  elements  ate  deleted.  To  determine  whether 


inconsistencies  exist,  dependencies  have  been  recorded  for  each 
hypothesis  at  the  time  of  its  creation.  A  hypothesis  is  dependent  on  all 
elements  whose  existence  directly  resulted  in  the  creation  of  the 
liypoih:sis.  For  example,  if  an  open  polygon  is  completed  by 
hypothesizing  a  line  connecting  the  two  end  points  of  the  chain  of 
segments,  the  hypothesized  line  is  dependent  on  the  two  end  lines  of  the 
chain.  If  one  of  these  lines  is  modified  or  deleted,  the  hypothesis  must 
also  he  deleted,  for  the  conditions  under  which  it  was  created  arc  no 
longci  valid.  After  all  mergings  and  deletions,  the  remaining  edges  and 
vertices  in  the  wire-frame  object  are  added  to  the  model  object  After 
this  is  done  for  all  objects.  Those  objects  winch  are  incomplete  are 
completed  using  task  specific  knowledge,  as  described  in  (I  lei  man, 
Kanade,  and  Kuroe  82]  [I  lemiaii.  Kanade  and  Ktimc  83], 

Herman  has  also  been  developing  a  monocular  analysis  component 
for  the  31)  Mosaic  system  [I  Icrniatt  83]  (in  this  volume).  I  bis  component 
reconstructs  the  three-dimensional  shape  of  a  complex  urban  scene  from 
a  single  image.  I  lis  approach  exploits  task-specific  knowledge  involving 
block-shaped  objects  in  an  uthan  scene.  First,  linear  connected 
structures  in  the  image  arc  generated:  these  are  meant  to  represent  edges 
and  veiticcs  of  buildings.  Next,  the  21)  structures  arc  converted  into  31) 
wire  frames.  Finally,  a  surface-bused  description  of  the  scene  is 
generated  from  the  wire  frames. 

In  our  database,  we  have  two  different  views  of  part  of  Washington. 
D.C.:  a  stereo  pair  for  one  view  and  a  single  image  for  the  other. 
Kvent (tally  we  will  merge  the  31)  wire  frames  obtained  front  the  single 
image  with  the  seem:  model  obtained  from  the  stereo  pair. 

2  Theory  lor  Shape  Understanding 

At  CMU,  we  have  been  notking  on  the  geometrical  aspects  of  image 
constraints  for  extracting  shape  from  images.  We  have  continued  our 
effort  in  this  important  area  to  develop  fundamental  theories  and  their 
applications  for  recovering  threo-dituension.il  s'  ipes  from  images.  Our 
new  results  include; 


•  I'hcoiy  of  circular  straight  homogeneous  generalized  cylinders 
[Shaler  uml  Kanade  83| 

•  Stereo  by  U> n.niiic  programming  in  a  thiec-ilimension.il  scorch  space 
[Olii.i  oiul  Konode  83| 

•  Optical  Him  methods  lor  measuring  object  motion  in  on  \-ray  imogc 
sequence  |Cornelius  ond  Kanotle  8J| 

•  A  method  lor  obtaining  topological  coirc'-pondcncc  of  line  drawings 
of  multiple  view  s  |  l  liorpe  ond  Shafer  83| 


2.1  Theory  of  Generalized  Cylinders  for 
Vision 

Motivated  first  work  in  the  shadow  analysis  [Shaler  and  K made  82). 
in  which  the  shadow  volume  is  o  generalized  cylitu,  sh  il >  and 
Kanade  [Shafer  and  Kanade  83)  (in  (Ins  volume)  luve  investigated  die 
formal  properties  of  generalized  cylinders.  In  recent  years.  Ilinford  s 
generalized  cylinders  have  become  tin  important  tool  for  shape 
representation  in  image  understanding  systems  |ltrooks  Sl|.  However, 
research  lias  been  hampered  In  a  lack  of  analytical  results  for  these 
shapes.  Shuler  and  Kanade  start  with  a  definition  lor  Straight 
Homogeneous  Generalized  Cylimlets,  those  gencializ.ed  cylinders  with  a 
straight  axis  and  with  cross-sections  which  have  constant  shape  hut  vary 
in  si/e.  1  his  class  of  shapes,  while  still  quite  large,  has  prupeitics  which 
make  considerable  analysis  possible. 

The  results  begin  witli  deriving  formulae  for  points  and  surface 
normals  for  these  shapes.  Theorems  are  presented  concerning  the 
conditions  tinder  which  multiple  descriptions  can  exist  for  a  single  solid 


shape.  Ilien  projections  and  contour  generators  are  anah.cd.  The 
stmngcst  icstilis  are  obtained  for  solids  of  revolution  (which  arc  named 
Kndu  C  iruil  ir  SIIGCs).  for  which  a  closed- form  method  for  analyzing 
image  contours  is  presented.  Shafer  and  Kanade  has  shown  that  a 
piuine  of  the  contours  of  a  solid  of  revolution  is  ambiguous,  with  one 
degree  of  freedom  related  to  the  angle  between  the  line  or  sight  and  the 
solid's  axis.  I  lie  ambiguity  can  be  resolved  by  other  constraints  such  as 
those  from  shadow  contours. 


2.2  Optical  Flow  Method  tor  Object  Motion 
in  X-ray  Images 

In  calculating  optical  How  from  an  image  sequence,  Horn  and 
Sclninck  |l  lorn  and  Schunck  81)  assumed  that  t lie  image  brightness 
cot  responding  to  the  same  physical  point  does  not  change,  together  with 
the  assumption  of  smoothness  of  velocity  over  the  image.  However,  this 
assumption  of  zero  brightness  change  severely  limits  the  allowable 
motions.  Rotations,  translations  in  depth,  and  deformations  often  result 
in  u  change  in  the  image  brightness  corresponding  to  a  single  physical 
point.  Also,  the  assumptions  of  smooth  ness  and  zero  brightness  change 
do  not  hold  at  the  boundary  of  the  object. 

It  was  shown  that  the  problems  of  assuming  zero  brightness  change  is 
magnified  when  we  try  to  apply  the  method  to  an  x-ray  image  sequence. 
(In  x-ray  images,  the  brightness  of  each  point  depends  on  the  amount 
and  density  of  the  mass  between  the  x-ray  source  and  the  film.) 
Cornelius  and  Kanade  |('oi  iielius  and  Kanade  83|  (In  this  volume)  have 
adapted  the  optical  How  algorithm  so  that  il  can  handle  the  brightness 
change  and  cope  with  the  difficulty  caused  by  the  smoothness 
assumption  across  the  boundary.  This  algorithm  assumes:  (a)  the 
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I  't-iin-l:  rhe  runviit  sliuclnrc  of  the  Incremental  '-1>  Mosaic 
s>sicm:  boxes  ;irc  majoi  modules  and  ellipses  arc  dala 
structures 


brightness  changes  corresponding  to  a  single  physical  point  can  lie 
described  by  the  first-order  expansion  of  the  image  intensity  function 
/(■'..hO:  (h)  the  'clocity  field  (r(.  r)  changes  smoothly  in  a 
tic ighbot hood,  unless  the  neighborhood  contains  an  occluding 
boundary:  (c)  the  rale  of  change  in  brightness  (,///<//)  is  smooth  in  a 
neighborhood.  An  iteralixe  procedure  was  devised  to  compute  die 
velocity  field  and  the  change  of  brightness  (ie..  change  of  thickness  in 
the  case  id  x-ray  images)  under  these  conditions. 

I  his  algorithm  can  correctly  recover  the  object  motion  from  the  x-ray 
images  of  an  expanding  ellipsoid.  We  have  actually  applied  the  method 
to  real  x-ray  images  of  a  dog's  heart  taken  on  Mini  at  fiO  frames  a  second, 
in  which  a  radio-opaque  dye  was  injected  into  the  pulmonary  artery  just 
hcfoic  the  image  sequence  was  taken,  for  this  case,  the  changes  in 
brightness  will  reflect  the  expansion  or  contraction  movement  of  the 
beart  in  the  direction  perpendicular  to  the  image  plane  since  the  dye 
lilted  heart  is  the  pi  imary  source  tif  motion.  vVc  have  generated  a  movie 
of  the  velocity  vectors  for  an  entire  heart  cycle  and  shown  that  it 
coincides  well  with  the  apparent  motion  seen  in  die  actual  cine 
angiogram. 


2.3  Stereo  by  3D  Search 

Olita  and  Kauude  [Oliln  and  Kanadc  8.1]  have  been  developing  a 
stereo  algorithm  to  obtain  ail  optimal  matching  surface  in  a  three 
dimensional  search  space.  Their  approach  is  purely  computational. 
When  a  pair  or  stereo  images  is  rectified  so  that  the  epipolar  lines  are 
hoii/ontal  scan  lines,  w  e  can  seaich  for  a  pair  of  corresponding  points  in 
right  and  left  images  within  the  same  scan  lines.  We  call  this  search 
intrii-samliiw  search.  This  imra-scanline  search  can  be  treated  as  the 
problem  of  finding  a  matching  path  on  a  two  dimensional  search  plane 
whose  axes  are  right  and  left  scanlines,  A  dynamic  programming 
technique  can  efficiently  handle  this  search  [Baker  82],  The  imra- 
sc.i til i nc  search  alone,  however,  does  not  take  into  account  mutual 
dependency  between  seaiilincs  in  a  image:  that  is,  mlcr-scanUne  search  is 
necessary  to  find  the  consistency  across  scan  lines. 

As  shown  m  Figure  2,  we  cast  the  problem  of  stereo  as  that  of  finding 
a  matching  surface  (i,c„  a  set  of  matching  paths)  in  a  three  dimensional 
search  space,  which  is  a  stack  of  the  2-0  search  planes  and  whose  axes 
are  left-image  x  position,  right-image  x  position  and  the  scan  line  (y 
position  ol  image).  Vertically  connected  edges  provide  the  consistency 
constraints  across  the  scan  line  axis.  I  luis,  stereo  involves  two  searches: 
one  is  imra-scanline  search  for  possible  correspondence  and  the  other  is 


intei-scaminc  search  lor  consistency  between  connected  edges.  Olita 
and  Kanadc  employ  dynamic  programming  for  both  searches, 

Ihc  matching  is  based  on  edges,  and  the  positions  of  edges  arc 
obtained  as  zero-crossings  of  the  II)  I  aplacian  (taken  along  cadi  scan 
line)  in  both  led  and  l  ight  images.  I  he  imra  scanline  search  locates 
many  partial  paths  for  each  pair  of  left  and  right  scan  lines,  as  candidates 
ol  components  which  may  consist  of  the  final  matching  surface.  The 
intcr-sc. inline  search  uses  those  partial  paths  as  dements,  and  searches 
fur  the  combination  of  them  which  is  most  consistent  with  connected 
edges.  I  hese  two  searches  proceed  simultaneously,  I  lie  criteria  (i.c„  the 
cost  function)  in  the  search  involve  a  monotonicity  assumption,  the 
siniilai ity  of  intensity  between  edges,  and  surface  smoothness, 

Our  main  task  domain  is  urban  aerial  photogtaplis,  but  images  in 
i  i net  domains  are  also  used  to  show  the  performance  of  our  stereo, 
figure  .1  is  a  ty  picul  example  of  aerial  stereo  images.  Figure  4  (a)  shows 
the  dispat  ity  map  obtained,  and  figure  4  ( h )  shows  an  isometric  plot  of 
the  depth  map.  Notice  that  the  detailed  structures  of  the  roof  of  the 
building  and  the  bridge  over  the  highway  are  clearly  extracted,  'flic 
output  ol  this  stereo  program  will  be  used  as  another  source  of  .11) 
information  in  the  Incremental  .11)  Mosaic  system. 


3.  MAPS 

MAPS  is  a  large  integrated  image/map  database  system  for  photo 
interpretation  tasks.  It  contains  high  resolution  aerial  photographs, 
digitized  maps  and  other  cartographic  products,  combined  with  detailed 
.11)  descriptions  of  man-made  and  natural  features  in  the  Washington 
I).  C.  area  [McKcown  and  Kanadc  8l][McKcown  and  Denlinger  82], 
In  die  September  1982  proceedings,  McKcown  [McKcown  82]  reported 
the  addition  of  the  concept  map  to  facilitate  inquiries  at  the  symbolic 
level,  Since  then,  the  concept  map  has  been  used  to  build  a  hierarchy 
tree  data  structure  which  represents  the  whole-part  relationships  and 
spatial  containment  of  map  feature  descriptions  [McKcown  8.1]  (in  this 
volume).  Unlike  regular  decomposition  methods  such  as  quad-tree 
organizations,  the  hierarchical  containment  tree  permits  a  hierarchical 
search  in  the  database  based  on  natural  relations  among  features  which 
arc  intrinsic  to  the  conceptual  map  and  may  have  some  analogy  with 
how  humans  organize  a  "map  In  the  head"  to  avoid  search.  Thus  die 
hieiarcliy  tree  improves  the  speed  of  spatial  computations  by  quickly 
constraining  search  to  a  portion  of  the  database. 

As  an  application  of  MAI’S.  McKcown  lias  started  investigation  of 
1 1 ilc  based  systems  for  the  control  of  image  processing  and 
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iii Icrprci. ition  with  respect  to  ;i  world  model.  I  he  SI’AM  system 
[McKeown  and  MeDermott  8.1]  is  a  system  for  testing  the  idea  of  using 
tile  combination  ol  task  independent  low-level  image  processing  tools,  a 
fills'  based  system  and  a  map  database  expert. 

4.  Systolic  Array  Processors  for  Vision 

I  ogethei  with  the  V I  .SI  group  ol  CM  U,  we  have  started  investieating 
applications  of  systolie  array  processors  made  of  PSCs  (Programmable 
Systolic  C  hips)  [I  isher  et  til.  83]  [Pisher.  el  al.  A  83]  to  image  processing. 

I  sample  tasks  we  are  considering  include:  smoothing,  edge  detection, 
optical  llow.  iterative  image  registration,  and  matching  by  dynamic 
piogi. miming.  We  expect  one  to  three  orders  of  magnitude 
impiovements  in  the  speed  of  performing  these  image  processing  tasks 
over  conventional  machines. 
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Abstract 

The  Image  Understanding  Project  at.  Columbia  has 
centered  its  efforts  on  basic  “middle-level”  vision 
research:  the  representations  and  algorithms  concerned 
with  deriving  surface  information  from  low-level 
aggregate  cues.  At  present,  the  effort  has  four  major 
concerns:  theory  and  analysis,  integrated  s\ stems,  image 
research  aids,  and  high-speed  hardware.  This  report  on 
our  first  full  year  summarizes  our  progress  in  each  of 
these  areas, 

1  Introduction 

The  Image  Understanding  Project  at  Columbia  is 
new  and  small,  but  growing.  In  our  first  full  year,  we 
have  acquired  an  operating  laboratory,  and  defined  and 
attacked  our  research  concerns.  (Currently,  our 
experimental  base  consists  of  a  VAX  750  with  Crinnell 
27a,  with  (  MU  image  and  graphic  software  operating  on 
USC'IPl  and  other  images.  Additional  hardware  and 
software  enhancements  are  planned.) 

Our  research  emphasis  is  on  that  level  of  image 
understanding  that  moderates  low-level  cues  into  surface 
information.  We  have  developed  several  new  algorithms 
that  make  some  of  these  transformations  possible,  and 
have  begun  to  quantify  their  accuracy.  Work  is  under 
way  to  integrate  several  of  these  surface-constraining 
algorithms  into  a  coherent,  distributed  system;  two 
separate  free-running  algorithms  have  been  executed  and 
are  being  refined.  Because  the  algorithms  and  their 
control  is  complex,  we  are  implementing  various  graphic 
ways  in  which  the  rich  intermediate  data  can  be 
represented  easily  to  the  experimenter.  Lastly,  we  have 
devised  and  simulated  some  low-level  vision  algorithms  for 
a  novel  supercomputer  being  independently  developed  at 
Columbia. 

2  Theory  and  Analysis 

Much  of  our  theoretical  work  concerns  the 
calculation  of  surface  orientation  constraints  from  low- 
level  image  cues.  One  representation  that  has  proven 
very  useful  for  this  and  other  tasks  is  the  gradient  space— 
independently  of  whether  the  image  is  taken  under 
orthographic  or  central  projection.  We  have  helped  to 
summarize  some  of  its  most  salient  properties  (especially 
those  under  projection)  in  a  type  of  researcher’s  reference 
card  [Shafer  83-  Shafer  82].  We  have  also  highlighted 
some  of  the  difficulties  that  can  occur  under  perspective; 
algorithms  known  for  their  utility  under  orthog-apliy  can 
fail  in  unexpected  ways  [Render  82a]. 

Many  of  the  algorithms  we  have  devised  for  our 
middle-level  work  are  derived  from  a  central 
methodological  paradigm  called  “shape  from  texture” 
[kauade  83;  Render  82b].  We  have  now  applied  the 
paradigm  in  two  additional  areas,  deriving  additional 
surface  constraint  relations  and  procedures.  (Versions  of 
these  two  papers  appear  iii  this  proceedings. 1 


The  first  area  concerns  gravity,  which  induces 
certain  preferred  scene  orientations.  We  have  shown  how 
gravitationally-related  labels  such  as  “vertical”  can  be 
used  in  the  gradient  space,  and  how  such  knowledge  can 
generate  additional  constraints  on  surfaces  [Render  83a; 
Render  831)].  In  particular,  we  have  shown  that  sensor 
parameters,  surface  parameters,  and  environmental  labels 
mutually  interact  so  that  knowledge  of  any  two  constrains 
the  third;  further,  often  this  knowledge  can  be 
liem  isticallv  derived  using  llough-like  methods. 

The  second  area  concerns  linear  extents:  image 
primitives  that  possess  measurable  length.  We  have 
shown  how  assumptions  of  equality  of  extent  provide 
surfaces  constraints,  sometimes  in  non-intuitive  ways 
[Render  83c],  In  particular,  under  orthography,  lengths 
behave  very  much  like  right  angles;  under  perspective, 
certain  configurations  induce  several  simple  iconic  (image 
plane)  geometric  constructions  for  vanishing  points. 

Lastly,  we  (David  Lee)  have  initiated  the  analysis  of 
the  error  "behavior  of  a  few  of  these  algorithms.  We 
believe  that  a  fruitful  framework  is  that  of  the 
information-centered  approach  under  independent 
development  at  Columbia.  We  expect  to  be  able,  given  a 
desired  accuracy  of  surface  orientation,  to  derive  lower 
limits  on  the  resolution  necessary  in  the  image,  or  on  the 
confidences  necessary  in  the  image  primitive  array. 

3  Integrated  Systems 

We  (Mark  Moerdler)  have  started  work  on  the 
design  and  implementation  of  a  middle-level  vision  system 
that  integrates  knowledge  about  surfaces  from  multiple 
independent  sources.  Present  design  is  patterned  on  the 
blackboard  model  of  perceptive  systems.  Each  source 
d  rives  surface  information  on  the  "basis  of  one  particular 
shape  algorithm. 

Two  such  sources  have  been  coded.  Although 
)rimitive  and  under  refinement,  their  results  are  shown  in 
he  figures  following  this  report.  ^  Figure  1  shows  a 
synthetic  image  (“Manhattan’  Sunrise”)  with  two  surfaces 
snaring  a  common  orientation;  the  lower  surface  is 
composed  of  two  textures.  In  Figure  2,  an  algorithm 
based  on  equal  extents,  applied  to  the  “waves”,  generates 
multiple  vanishing  points,  very  near  the  actual  (but 
invisible)  vanishing  line.  In  Figure  3,  an  algorithm  based 
on  the  detection  of  colinearities  in  random  textures  (Peter 
W  eselcy),  applied  to  the  “sand”,  generates  a  smear  of 
vanishing  points  that  straddles  the  vanishing  line.  Since 
vanishing  lines  niap  one-to-one  into  surface  orientations, 
these  two  algorithms  implicitly  calculate  local  slant  and 
tilt. 

4  Image  Research  Aids 

One  problem  with  the  development  of  image 
understanding  systems  is  the  vast  amount  of  complex 
intermediate  data  that  they  produce.  In  particular,  the 
middle  levels  of  vision  are  replete  with  partial  assertions 
about  the  underlying  surfaces. 
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Since  surfaces  have  two  parameters  of  orientation  and  one 
parameter  of  depth,  and  since  each  image  point  may  have 
multiple  surface  hypotheses,  the  problem  of  observing  and 
understanding  an  executing  system  becomes  one  of 
human-compatible  graphic  economy. 


[Kendcr  82a]  Render,  J  R  Why  Perspective  is 
Difficult:  I  low  r  wo  Algorithms  Fail.  Proceedings  of  the 
National  Conference  on  Artificial  Intelligence,  Aug.  1982 
pp.  9-12. 


We  (laid  Douglas)  have  begun  research  into  the 
various  modalities  of  human  vision  that  can  be  exploited 
in  this  task.  Primarily,  we  are  constructing  a  surface 
synthesis  system  that  will  artificially  texture  (locally 
planar)  regions  of  an  image  in  ways  that  suggest  their 
orientations.  Additionally,  we  have  begun  to  explore  the 
ways  in  which  orientation  uncertainty  and/or  constraints 
can  be  graphically  displayed  by  means  of  icons,  motion,  or 
color.  Our  initial  icons  are  based  on  “sequins”  (circles 
seen  in  perspective). 


5  High-speed  Hardware 


Several  parallel  machine  architectures  have  been 
proposed  that  perform  imago  understanding  algorithms  at 
Ingh  sliced.  The  NON- VON  supercomputer  being  built  at 
C  oluinbia  is  a  tree  structured  one.  Its  primary  proeessin,r 
system  consists  of  a  very  large  number  of'  very  small 
processing  elements  (PLs),  each  containing  a  small 
amount,  of  RAM  and  some  hardware  for  performing 
arithmetic  and  logical  operations.  The  PEs  are  connected 
together  in  the  form  of  a  complete  binary  tree  We 
(Hussein  Ibrahim)  have  found  that  this  architecture  lends 
itself  easily  and  naturally  to  the  representation  and 
manipulation  of  binary  images  by  quad  trees. 


A  binary  picture  at  its  finest  resolution  is  stored  in 
the  leaves  of  the  tree,  with  each  PE  holding  one  picture 
point.  Higher  levels  in  the  tree  represent  coarser 
resolutions;  building  the  quad  tree  can  be  done  in 
logarithmic  time.  Connected  components  can  be  found  in 
time  proportional  to  the  number  of  nodes  actually 
representing  regions  in  the  tree.  Several  other  algorithms 
for  region  properties  again  take  logarithmic  time.  These 
algorithms  have  all  been  tested  on  a  simulator.  We 
expect  to  develop.- the  usual  complement  of  image 
processing  routines,  with  a  target  task  in  mind. 
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Our  overall  approach  to  the  study  of  vision  is  based 

on  a  number  of  representations  of  the  visible  world, 
reviewed  in  previous  Image  Understanding  Proceedings. 
Our  work  to  date  has  concentrated  primarily  on  the 
initial  representations  such  as  th.e  primal  sketch  and 
reflectance  maps,  and  the  computation  from  them  of 
depth,  surface  orientations,  and  material  properties. 
Our  current  emphasis  is  on  the  integration  of  the 
different  sources  of  information,  the  analysis  and 
representation  of  shape,  the  refinement  and  evaluation 
of  the  individual  modules,  the  extension  of  our  approach 
to  deal  with  time  varying  images  and  moving  objects, 
and  the  transfer  of  our  results  to  real  time  hardware 

/■A  ay  .  TV'1'*- 

implementation.  In  this  report  we  review  -our  recent 
work  on  the  analysis  of  edge  detection,  the  measurement 
of  visual  motion,  the  correspondence  problem,  the 
refinement  and  evaluation  of  stereo  algorithms,  the 
detection  of  depth  discontinuities,  the  integration  of 
surface  maps,  and  the  interpretation  of  shape  from 
contours,  andthe  acquisition  of  objects  with  photometric 
stereo,  f —  -  _ . 

1.  Edge  detection  analysis 

Much  of  our  work  on  edge  detection,  discussed  in 
previous  Image  Understanding  Workshops,  used  the  zero¬ 
crossing  contours  in  the  image  filtered  through  V2G  filters 
of  different  sizes.  Any  edge  detector  scheme  to  be  used  in 
practical  applications  must  show  considerable  robustness 
and  immunity  to  various  types  of  noise.  Continuing  his  work 
aimed  at,  developing  a  practical  real  time  stereo-matching 
system,  Nishihara  has  examined  recently  the  effect  of 
image  noise  on  the  V2G  convolution  and  the  zero-crossing 
contours.  In  parallel  with  the  effort  of  developing  further 
our  standard  edge  detection  techniques  and  improving  their 
reliability,  we  are  also  pursuing  new  approaches  to  the 
edge  detection  problem.  In  particular,  we  are  developing, 
implementing  and  testing  a  new  line  finder  .  In  another 
investigation  we  are  characterizing  general  properties  of 
edge  detection  schemes.  We  have  also  established  some 
results  connecting  the  locations  of  zero-crossings  with  the 
principle  lines  of  curvature  of  a  surface.  Wc  now  review 
each  rr  th’.se  four  topics  in  turn. 
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Noise  Sensitivity  of  Zero-crossings 

Distortions  due  to  noise  can  be  considered  as  pertur¬ 
bations  of  the  shapes  of  regions  of  constant  sign  in  the 
convolution  output.  Zero-crossing  patterns  are  generally 
stable  in  the  presence  of  low  to  moderate  image  noise  levels. 

The  most  common  serious  distortion  of  these  patterns — 
for  stereo  matching— occurs  when  two  adjacent  regions  of 
constant  sign  merge  or  a  single  region  splits  as  a  function 
of  noise  introduced  by  the  cameras  or  changing  camera 
position. 

Only  a  small  number  of  pixels  need  change  sign  at 
strategic  locations  in  order  for  such  merges  and  divisions  to 
occur,  resulting  in  a  large  scale  change  of  the  zero-crossing 
geometry.  The  frequency  of  these  changes  is  low  in  a  high 
quality  image,  but  they  cannot  be  avoided  when  noise  is 
present  and  contrast  is  low,  a  ubiquitious  phenomenon  in 
practical  images.  This  distortion  turns  out,  however,  to  be 
strongly  confined  to  specific  spatial  neighborhoods  of  the 
image  where  the  convolution  magmtude  is  small.  Outside 
these  neighborhoods,  the  convolution  sign  is  constant  and 
stable,  even  for  relatively  large  noise  levels.  The  sign- 
representation  dual  of  the  zero-crossing  also  promises  to 
yield  more  easily  to  a  careful  statistical  analysis.  Nishihara 
is  investigating  ways  in  which  the  approach  can  be  used 
to  improve  noise  tolerance  in  stereo  matching  [Nishihara, 
1982,  1983]. 

Optimal  edge  detection  operators 

Canny  [1983]  has  investigated  the  problem  of  deriv¬ 
ing  an  optimal  edge  detection  operator  from  a  precise 
formulation  of  detection  and  localization  [Binford  1981]. 
He  finds  that  the  optimal  shape  is  (approximately)  the 
first  derivative  of  a  Gaussian.  An  important  property  of 
an  edge  detector  is  that  it  should  produce  edge  tokens 
that  are  accurately  located.  It  should  also  have  a  low  prob¬ 
ability  of  misclassification  of  edges  (i.e.  it  should  produce 
few  erroneous  edges  and  still  be  able  to  detect  weak  or 
noisy  edges).  In  particular,  the  operator  should  not  produce 
multiple  responses  to  a  single  edge.  The  ability  to  cor¬ 
rectly  classify  potential  edge  points  relates  directly  to  the 
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signal-to-noise  ratio  of  the  output  of  the  operator,  which 
is  frequently  used  as  the  design  criterion  for  an  optimal 
detector.  The  localizing  ability  of  the  edge  detector  is  often 
cither  ignored  or  only  indirectly  treated. 

Canny’s  derivation  consists  of  three  steps.  First,  the 
design  is  constrained  to  linear  operators  only.  Second,  the 
optimal  linear  operators  are  combined  in  a  non-linear  way 
that  is  again  optimal  (or  near  optimal)  with  respect  to 
the  criteria  of  detection  and  localization.  Finally,  the  edge 
points  output  from  the  non-linear  detector  are  processed 
by  a  line-following  procedure  which  assigns  labels  to  the 
segments  of  contour  and  to  each  segment  a  set  of  parameters 
that  describe  the  type  of  edge  transition  (amplitude  of  the 
step,  uncertainty  in  amplitude,  uncertainty  in  position).  The 
resulting  operators  have  been  implemented  in  microcode 
on  a  LISP  machine,  and  form  the  basis  for  our  work 
on  smoothed  local  symmetries  and  shape  from  contour. 
The  operator  has  also  been  applied  to  textured  images  to 
generate  hierarchical  texture  descriptions. 

The  linear  operator  is  directly  optimized  with  respect 
to  both  signal-to-noise  ratio  and  localization.  Canny  shows 
tiat  there  is  an  uncertainty  principle  relating  the  two 
quantities  and  that,  because  of  noise,  an  edge  cannot 
be  simultaneously  detected  and  localized  with  arbitrary 
precision.  There  is  a  unique  operator  shape  (approximately 
the  first  derivative  of  a  Gaussian)  that  attains  this  limit. 
The  width  of  the  operator  determines  the  tradeoff  in  output 
signal-to-noise  ratio  versus  localization.  A  narrow  operator 
gives  better  localization  but  poorer  signal  to  noise  ratio 
and  vice-versa.  To  handle  variations  in  the  signal  to  noise 
ratio  in  the  image,  operators  of  several  widths  are  used. 
Where  several  operators  respond  to  the  same  edge,  one  of 
them  is  selected  by  the  algorithm  so  as  to  give  the  best 
localization  while  preserving  an  acceptable  signal-to-noise 
ratio.  When  the  one  dimensional  formulation  is  extended 
to  two  dimensions,  the  same  criteria  of  optimality  are  used. 
This  leads  to  a  system  of  directional  operators,  with  their 
noise  estimation  and  edge  detection  all  being  performed 
independently. 

The  automatic  switching  between  operators  requires 
local  estimation  of  the  noise  energy  in  the  operator  outputs. 
This  is  difficult  because  there  is  little  information  available 
at  the  operator  outputs  to  indicate  whether  a  response  is  du> 
to  an  edge  or  to  noise.  Canny  has  developed  a  scheme  that 


uses  a  model  of  an  edge  (in  this  case  a  step  edge)  t.o  predict 
the  response  of  each  operator.  He  then  removes  responses 
of  this  type  to  leave  the  response  due  to  noise  alone.  The 
noise  estimation  is  done  from  the  outputs  of  the  operators 
rather  than  directly  from  the  image,  because  detection  and 
localization  performance  is  determined  by  that  component 
of  the  image  noise  parallel  to  the  operator  direction,  and 
which  lies  within  the  bandwidth  of  the  operator.  Where 
image  noise  is  not  spectrally  flat,  and  in  particular  where 
there  is  fine  texture  (element  size  much  smaller  than  the 
operator  width),  the  texture  may  be  modelled  as  directional 
noise,  and  the  detector  will  still  be  able  to  respond  to  weak 
edges  in  directions  where  there  is  little  texture  energy. 

The  detector  is  being  evaluated  in  comparison  with 
several  otlmr  well-known  detectors,  such  as  the  Marr- 
Hildrcth  Laplacian  of  Gaussian  operator  (1980)  and  the 
second  directional  derivative  detector  of  Haralick  (1982). 
Experiments  are  being  performed  using  the  operator  as  the 
front  end  for  the  Marr-Poggio  stereo  algorithm  (Grirnson 
1981a, b)  as  well  as  subjective  evaluations  of  the  detector 
output  on  a  variety  of  natural  images,  in  particular  on 
images  that  contain  boundaries  between  textured  regions. 
The  multiplicity  of  operators  enables  the  detector  to  locate 
intensity  changes  that  arc  occurring  at  different  scales  in 
the  image.  The  use  of  directional  operators  allows  it  to  find 
weak  linear  edges  when  the  signal  to  noise  ratio  is  very 
poor.  It  is  felt  that  linear  edges  form  an  important  subclass 
of  intensity  change's  and  that  they  occur  often  enough  in 
real  images  to  warrant  special  treatment.  The  traditional 
problems  with  highly  directional  operators  were  that  they 
tended  to  extend  the  boundaries  of  objects  beyond  corners 
and  gave  polygonal  responses  to  curved  surfaces.  These 
are  dealt  with  in  the  new  detector  by  the  addition  of 
applicability  constraints  for  each  directional  operator  based 
on  how  well  the  image  locally  approximates  a  linear  edge. 

The  detector  has  also  been  used  as  the  front  end  for 
two  hand-eye  vision  programs.  The  first  of  these  simply 
tracks  contours  drawn  on  some  surface.  The  second  takes 
the  raw  edges  that  mark  the  boundaries  of  objects  and 
produces  bounding  polyhedra  of  minimum  additional  area. 
The  latter  will  be  used  in  conjunction  with  automatic  path 
planning  programs. 

Parallel  to  Canny’s  development  of  an  optimal  edge 
detector,  Poggio  and  Torre  have  begun  an  investigation, 
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presently  in  progress,  of  edge  detection  by  dividing  the 
problem  into  two  main  steps:  a  derivative  operation  and 
a  filtering  operation  to  reduee  the  noise.  Each  of  these 
steps  can  be  characterized  in  general  terms.  If  the  detection 
of  edges  is  based  on  detection  of  extrema  in  the  output 
of  the  filter  then  direetional  derivatives  should  be  used 
in  connection  with  direetional  odd  filter  functions.  If  edge 
detection  is  to  be  performed  via  zero-crossing  detection  then 
rotationally  symmetric  differential  operators  must  be  used 
together  with  symmetric  filter  functions.  If  the  differential 
operator  is  linear  the  two  steps  of  differentiation  and  filtering 
commute  and  associate  with  interesting  implications  for 
fast  hardware.  For  nonlinear  differential  operators  the  two 
operations  in  general  must  be  performed  separately  and 
furthermore  their  order  is  important.  Poggio  and  Torre  have 
examined  in  particular  two  rotationally  symmetric  operators 
:  the  second  direetional  derivative  along  the  gradient  -  a 
non-linear  operator  -  and  the  Laplaeian  -  a  linear  operator. 
It  is  easy  to  show  that  there  are  edges  that  escape  detection 
by  the  Laplaeian  but  not  by  the  seeond  derivative  along'  the 
gradient.  Furthermore,  the  zero-crossings  of  the  Laplaeian 
coincide  with  the  zero-crossings  of  the  seeond  directional 
derivative  along  the  gradient  if,  and  only  if,  the  mean 
curvature  of  the  intensity  function  is  locally  zero. 

Three  classes  of  filters  have  been  analyzed  in  detail: 
bandlimited,  support  limited  and  filters  with  minimal 
uncertainty  in  space  and  frequency.  The  filters  of  the  first 
class  can  be  synthetized  in  terms  of  linear  and  circular 
prolate  functions;  in  the  seeond  elass,  Haar  functions  are 
the  most  interesting  basis  for  optimal  filters;  the  third 
elass  leads  to  the  study  of  Hermite  functions.  Poggio  and 
Torre  derive  formulae  for  computing  the  uncertainty  of  an 
arbitrary  filter  using  its  decomposition  in  Herinite  functions. 
They  also  observe  that  a  filter  of  minimal  uncertainty 
combines  maximum  localization  in  spaee  with  a  minimum 
number  of  zeros  in  its  output  to  Gaussian  white  noise. 
In  particular,  the  seeond  derivative  along  the  gradient, 
successively  smoothed  by  a  eireularly  symmetric  Gaussian 
filter  is  a  near-optimal  sehemc  in  terms  of  these  criteria. 

In  a  separate  investigation,  we  report  oil  a  2-D  version  of 
Logan’s  theorem,  which  gives  sufficient  conditions  for  the 
completeness  of  the  zero-crossing  representation  in  the  case 
of  directional  bandpass  filters  [Poggio  et  al.,  1982], 

Line s  of  curvature  and  zero-crossings 


In  reeent  years,  workers  in  vision  have  shown  con¬ 
siderable  interest  in  the  principal  lines  of  curvature  of  sur¬ 
faces.  For  example,  curvature  patches  have  been  proposed  as 
a  representation  for  visible  surfaces  [Brady  1983]  and  there 
exist  various  sehemes  for  dividing  objeets  into  parts  based 
on  extrema  and  zeros  of  curvature  [Brady  1983,  Hollerbaeh 
1975],  There  is  also  some  evidence  from  line  drawings 
[Stevens  1981]  that  curves  in  an  image  are  interpreted  as 
lines  of  curvature.  However,  it  has  been  suggested  that  the 
principal  lines  of  curvature  of  a  surface  can  only  be  com¬ 
puted  indireetly  and  with  great  difficulty.  The  complexity 
of  the  calculations  also  implies  poor  numerical  behaviour 
and  excessive  sensitivity  to  noise. 

Yuille  [1983]  proves  some  results  about  zero  crossings 
and  the  principal  lines  of  curvature  of  a  surface.  He  relates 
the  image  to  the  underlying  surfaee  geometry  by  the  image 
irradiance  equation  [Horn  1977]  and  suggests  that  the 
principal  lines  of  eurvature  can  be  computed  directly  from 
the  image. 

Various  direetional  zero  crossing  operators  are  con¬ 
sidered.  It  is  shown  that  direetional  zero  crossings  do  not 
necessily  correspond  to  physical  zero  crossings  (i.e.,  those 
that  correspond  to  sharp  changes  in  the  image  irradiance). 
A  result  is  derived  that  implies  that  directional  zero  cross¬ 
ings  are  physical  only  if  their  direction  is  along  the  line  of 
greatest  ehange  of  the  image  irradianee.  Sueh  direetional 
operators  have  been  argued  for  by  Canny  [1983]  and  Poggio 
and  Torre  [see  Poggio,  1982,  1983].  Conversely,  a  probabil¬ 
istic  argument  shows  that  the  directions  of  greatest  ehange 
of  the  image  irradiance  are  most  likely  to  be  along  the 
lines  of  principal  curvature.  This  suggests  that  many,  if 
not  most,  of  the  physical  zero  crossings  are  directional  zero 
crossings  along  the  principal  lines  of  curvature. 

Finally,  Yuille  proves  some  results  about  the  distribu¬ 
tion  of  zero  crossings  along  lines  of  curvature.  The  start¬ 
ing  point  is  the  work  of  Crimson  on  surface  consistency 
[Crimson  1981b].  With  relatively  weak  assumptions  about 
the  reflectance  function,  Grimson  derived  neccessary  and 
sufficient  conditions  in  one  dimension  for  the  oecurence  of 
direetional  zero  crossings  in  the  image  irradiance  in  terms 
of  the  surfaee  geometry.  He  then  used  some  probabilistic 
assumptions  about  the  reflectance  surface  to  extend  this 
result  to  two  dimensions  and  prove  the  Surface  Consistency 
Theorem.  This  theorem  was  the  basis  for  his  theory  of 
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surface  interpolation. 

Yuille  shows,  without  any  probabilistic  assumptions, 
that  (irimson  s  result  can  be  generalized  to  give  necessary 
and  sufficient  conditions  for  the  occurence  of  directional 
zero  crossings  along  the  principal  lines  of  curvature.  We  call 
this  result  the  Line  of  Curvature  Theorem.  It  suggests 
that  many,  if  not  most,  of  the  physical  zero  crossings  can 
be  associated  with  points  on  the  lines  of  principal  curvature 
which  are  near  the  extrema  of  the  principal  eurvatures. 
This  supports  the  view  that  lines  of  principal  curvature  can 
be  computed  directly  from  the  image.  In  turn  it  supports 
the  curvature  patch  representation. 

2.  The  computation  of  visual  motion 

In  the  area  of  visual  motion  analysis,  Hildreth  and 
Ullman  have  explored  a  zero-crossing  based  approach  to 
the  computation  of  the  two-dimensional  velocity  field  from 
the  changing  image  [Hildreth  &  Ullman,  1982;  Hildreth, 
1982,  1983;  Ullman  &  Hildreth,  1983).  The  starting  point 
was  the  work  of  Marr  and  Ullman  (1981),  in  which  the 
initial  detection  of  motion  takes  place  at  the  location  of 
zero-erossings  in  the  output  of  the  convolution  of  the  image 
with  a  V2G  operator.  The  main  computational  reason  for 
restricting  initial  motion  measurements  to  the  zero-erossings 
is  that  they  correspond  to  locations  in  the  image  for  whieh 
the  gradient  of  intensity  is  locally  maximum,  and  henee 
yield  the  most  reliable  motion  measurements  [Hildreth,  in 
press].  Hildreth  and  Ullman  have  extended  the  work  of 
Marr  and  Ullman,  to  allow  for  the  computation  of  the 
projected  two-dimensional  velocity  field  that  results  from 
the  general  motion  of  three-dimensional  surfaces  in  spaee. 

Due  to  the  aperture  problem,  local  measurements 
of  movement  in  the  changing  image  only  provide  the 
component  of  velocity  in  the  direction  perpendicular  to  the 
local  orientation  of  a  zero-crossing  contour.  In  particular, 
let  V(s)  denote  the  velocity  field  along  a  contour  (s  denotes 
arclength).  V(s)  can  be  decomposed  into  components 
perpendicular  and  tangent  to  the  curve: 

V(s)  =  v-L(s)u-^(s)  -f-  uT(s)u"T(s) 

(s)  and  ux(s)  are  unit  direction  vectors  perpendicular 
and  tangent  to  the  contour,  and  u-J-(s)  and  oT(.s)  are  the 
magnitudes  of  the  two  velocity  components.  The  first  term 


in  the  above  expression  can  be  measured  directly  from  the 
changing  image.  The  second  term  cannot,  and  must  be 
recovered  to  compute  the  velocity  field  V(s). 

The  main  theoretical  problem  for  this  recovery  is  that 
V(s)  is  not  specified  uniquely  by  information  available  in 
the  changing  image.  Additional  constraint  is  required  to 
compute  a  unique  velocity  field.  Drawing  from  the  work  of 
Horn  and  Schunck  (1981)  on  the  optieal  flow  computation, 
we  use  an  additional  constraint  of  smoothness  of  the  velocity 
field.  Physical  surfaces  are  generally  smooth,  compared  with 
their  distance  from  the  viewer;  under  motion,  they  usually 
generate  smoothly  varying  velocity  fields.  To  compute  a 
single  velocity  field,  we  find  the  velocity  field  which  is 
consistent  with  the  changing  image,  and  varies  the  least. 

Through  a  mathematical  analysis,  it  was  found  that  the 
above  smoothness  constraint  can  be  formulated  in  such  a 
way  that  a  unique  velocity  field  solution  is  guaranteed.  In 
particular,  the  local  change  in  V(s)  is  given  by  a  scalar 
measure  of  this  change  is  given  by  its  magnitude,  |^|. 
Ihe  total  variation  of  velocity  over  an  entire  contour  ean  be 
obtained  by  integrating  this  local  measure  over  the  curve. 
The  velocity  field  computation  then  seeks  the  velocity  field 
that  is  consistent  with  the  changing  image,  and  minimizes 
total  variation  in  velocity  along  contours.  It  can  be  shown 
analytically,  that  there  exists  a  unique  velocity  field  that  is 
consistent  with  the  measurements  of  v-L(s)  obtained  from 
the  image,  and  that  minimizes  the  particular  measure  of 
total  variation  given  by:  /|^[2ds. 

There  are  two  elasses  of  motion  for  which  the  velocity  field  of 
least  variation  is  the  correct  physical  velocity  field,  assuming 
orthographic  projection  of  the  scene  onto  the  image. 
The  first  consists  of  arbitrary  rigid  objects  undergoing 
pure  translation.  The  second  consists  of  three-dimensional 
objects,  whose  edges  are  straight  lines,  undergoing  rigid 
rotation  and  translation  in  space.  For  the  class  of  smooth 
curves  in  rotation,  the  velocity  field  of  least  variation  is,  in 
general,  not  the  physically  correct  one.  However,  it  is  often 
qualitatively  similar.  For  examples  in  which  the  true  and 
smoothest  velocity  fields  differ  significantly,  it  appears  that 
the  smoothest  velocity  field  may  be  more  consistent  with 
human  motion  perception, 

The  velocity  field  computation  has  been  implemented, 
using  a  standard  iterative  algorithm  from  mathematical 
programming,  known  as  the  conjugate  gradient  algorithm. 
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If  there  are  n  parameters  to  compute  (in  our  case,  the  x  and 
y  components  of  velocity),  this  algorithm  is  guaranteed  to 
converge  to  the  final  solution  in  at  most  n  steps.  The  method 
has  been  applied  to  a  number  of  images.  Qualitatively,  it 
appears  to  give  good  results  for  unrestricted  motion.  We 
plan  to  evaluate  the  method  further  on  both  synthetic  and 
natural  images  in  the  near  future. 

To  summarize,  the  computation  of  the  two-dimensional 
velocity  field  consists  of  two  main  steps:  (1)  initial  motion 
measurements  are  obtained  along  zero-crossing  contours, 
and  provide  the  component  of  velocity  perpendicular  to  the 
contour,  and  (2)  motion  measurements  are  then  integrated 
along  the  contours,  to  compute  the  two-dimensional  velocity 
field  V(s)  that  minimizes  total  variation,  given  by  the 
measure:  /|^|2ds.  Formulated  in  this  way,  a  projected 
two-dimensional  velocity  field  can  be  computed  for  rigid 
and  non-rigid  surfaces  undergoing  general  motion  in  space. 
The  computation  can  be  implemented  with  standard  op¬ 
timization  algorithms.  Computational  experiments  support 
the  feasibility  of  this  approach  to  motion  measurement. 

3.  The  correspondance  problem 

A  very  general  approach  to  the  correspondence  problem 
in  either  stereo  or  motion  consists  of  taking  a  large 
set  of  local  measurements  for  each  pixel  of  the  image 
ami  matching  the  most  similar  sets  between  tire  two 
images.  These  measurements  can  be  regarded  as  nonlinear 
functionals  representing  the  “primitives”  on  which  the 
pf  Mate!  ing  eonstreinls,  dictated 

by  the  specific  problem,  may  easily  ensure  uniqueness  of 
matching.  Although  a  large  set  of  primitives  may  appear 
rather  cumbersome  and  difficult  to  compute,  massive 
parallel  processing  which  begins  to  be  feasible  with  the 
new  solid  state  technologies,  makes  a  scheme  of  this  type 
quite  attractive.  Furthermore,  the  resulting  specificity  of 
matching  primitives  may  avoid  the  extended  use  of  complex 
constraints  which  are  more  difficult  t ,  im'jb  me  it.  it:  a  highly 
concurrent  system. 

The  main  problem  is  the  choice  of  the  appropriate 
class  of  functionals.  Poggio  has  considered  the  abstract 
computational  properties  of  a  specific  class  of  nonlinear 
functionals,  i.e.,  polynomial  functionals  [Poggio,  1983], 
For  the  correspondance  problem  in  ideal  noise-free  and 
distortion-free  images,  a  complete  set  of  linear  functionals 


can  be  proved  to  be  sufficient:  nonlinear  functionals  cannot 
impro/e  the  matching  (since  linear  functionals  separate 
points  in  a  Banach  space).  In  practice,  however,  the  number 
of  measurements  is  finite  and  actually  relatively  small;  under 
these  conditions  nonlinear  operators  might  represent  more 
compactly  the  relevant  information.  For  instance,  zero- 
crossing  maps  of  V2C  convolved  images  can  be  considered 
as  the  output  of  a  quadratic  functional  operating  on  the 
image  with  support  equal  to  the  underlying  Gaussian. 
Kass  and  Poggio  are  presently  exploring  correspondence 
schemes  based  on  sets  of  nonlinear  functionals.  This  effort 
is  motived  by  a  recent  algorithm  developed  by  Kass  to 
solve  the  correspondence  problem  and  based  on  a  large 
set  of  linear  functionals.  The  algorithm  is  based  on  the 
paradigm  of  combining  independent  measurements.  The 
underlying  idea  is  that  if  a  dozen  or  so  independent 
indications  of  correspondence  can  be  combined,  then  no 
single  measurement  need  be  dependable  in  order  for  the 
combination  to  be  quite  reliable.  A  set  of  nearly  independent 
linear  filters  based  on  first  and  second  derivatives  of  Gaussian 
smoothed  images  was  used  by  Kass.  He  was  able  to  show 
that  a  particular  computation  based  on  these  measurements 
can  reliably  determine  correspondence  for  textured  images 
with  signal  to  noise  ratios  of  two  or  more.  An  algorithm 
performing  this  computation  has  been  applied  to  a  few 
natural  images  with  encouraging  results.  The  algorithm 
and  its  Hnpleii.fcUatio5i  am  diotuss-ed  in  detail  in  these 
Proceedings  [Kass,  1983]. 

4.  Refinements  and  evaluation  nf  stpren  - 

In  previous  IU  reports,  we  have  described  the  theory 
and  implementation  of  Marr  and  Poggio’s  theory  of  human 
stereo  [Marr  and  Poggio,  1979;  Grimson  and  Marr,  1979; 
Grimson  1980,  1981a,  1981b].  The  input  to  the  stereo 
matcher  is  obtained  bv  convolving  the  left  and  rigH 
images  with  a  number  of  Difference- of-Gaussian  f  Iters  and 
locating  the  zero-crossings  in  each  such  convolution.  The 
matching  proceeds  in  a  coarse  uo  fine  manner,  finding 
zero-crossings  of  the  same  contrast  sign  and  roughly  the 
same  image  orientation,  within  a  predetermined  range  along 
horizontal  slices  of  the  rectified  images,  based  on  the  general 
distribution  of  zero-crossings.  As  a  consequence  of  testing 
the  algorithm  on  a  wide  range  of  natural  images,  a  number  of 
modifications  to  the  published  algorithm  have  been  made. 
First,  the  matching  of  zero-crossing  points  independent 
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of  their  local  context  may  lead  to  isolated  incorrect 
matches.  In  the  original  published  algorithm,  a  continuity 
constraint  is  applied  using  statistical  measurements  over 
areas  of  the  image.  While  this  was  demonstrated  to 
be  sufficient  on  a  range  of  test  images,  it  occasionally 
led  to  incorrect  matches  near  surface  discontinuities  or 
occlusions.  Similar  to  the  work  of  Mayhew  and  Frisby 
[1981]  and  Daker  and  Binford  [1981],  we  have  developed 
a  continuity  constraint  that  checks  for  consistency  along 
zero-crossing  contours  that  typically  correspond  to  a  single 
physical  edge.  This  constraint  implicitly  incorporates  the 
zero-crossing  orientation  constraint,  and  may  be  considered 
as  being  equivalent  to  matching  a  zero-crossing  contour 
from  one  image  against  an  envelope  about  a  contour 
in  the  other  image.  Second,  we  have  also  investigated 
the  sensitivity  of  the  algorithm  to  vertical  disparity  and 
other  image  distortions.  We  have  found  that  there  is 
tradeoff  between  the  resolution  of  disparity  information 
computed  by  the  algorithm  and  the  sensitivity  of  the 
algorithm  to  vertical  disparity.  Computational  experiments 
on  aerial  photographs  have  led  us  to  redefine  the  matching 
algorithm  to  match  zero-crossings  from  a  line  in  one 
image  to  zero-crossings  lying  within  2  or  3  lines  of 
the  corresponding  line  in  the  second  image,  reducing 
the  resolution  of  the  available  disparity  information,  but 
enabling  the  algorithm  to  match  rectified  images  containing 
small  residual  amounts  of  vertical  disparity.  As  in  the 
original  algorithm,  vertical  disparities  beyond  this  range 
are  handled  by  explicitly  changing  the  vertical  alignment  of 
the  images.  Interestingly,  psychophysical  data  suggest  that 
human  stereopsis  relies  on  a  registration  process  mediated  by 
appropriate  eye  movements,  to  correct  for  vertical  disparities 
larger  than  about  4’-7’  (Nielsen  and  Poggio,  forthcoming). 
We  are  presently  exploring  in  a  computational  analysis 
the  properties  of  the  registration  process  with  the  goal  of 
implementing  this  stage  as  an  integral  part  of  our  stereo 
algorithm.  In  a  separate  investigation,  Nishiharaand  Poggio 
[1982]  have  found  additional  support  for  the  matching 
primitives  used  in  our  stereo  algorithms.  They  have  shown 
that  the  sign  of  the  convolved  images  or  equivalently 
the  zero-crossings,  contain  sufficient  information  for  the 
matcher  to  operate  successfully  even  in  random-line  stereo 
pairs  invented  by  Julesz  and  Spivack  and  claimed  to  require 
the  computation  of  vernier  cues. 

The  main  emphasis  of  work  on  the  Grimson  implemen¬ 


tation  of  the  Marr-Poggio  theory  in  the  past  year  has 
been  in  applying  the  algorithm  to  aerial  photography.  The 
images  tested  have  contained  a  variety  of  scenes.  Included 
in  these  are  two  stereo  pairs  of  sections  of  the  University  of 
British  Columbia,  provided  by  the  Faculty  of  Forestry.  One 
is  of  a  combination  of  apartment  complexes  and  natural 
terrain,  (including  several  hundred  foot  high  Douglas  firs). 
The  second  is  of  a  hospital  complex,  with  a  variety  of 
different  sized  buildings.  The  third  pair,  supplied  by  Boeing 
Corporation,  is  of  a  complex  highway  intersection.  The 
fourth  pair,  supplied  by  the  Defense  Mapping  Agency,  is  of 
natural  terrain,  as  is  the  fifth  pair,  supplied  by  the  Army 
Engineering  Topographic  Labs.  The  sixth  pair,  supplied 
by  Stanford  University,  is  the  CDC  synthetic  images  of  a 
building  complex.  An  informal  evaluation  of  the  results  in 
currently  underway  in  conjunction  with  ETL. 

The  performance  of  the  matching  algorithm  can 
be  evaluated  on  two  grounds,  matching  efficiency  and 
disparity  localization.  Matching  efficiency  refers  to  the 
actual  correspondence  process  applied  to  the  zero-crossings 
contours,  While  the  specific  numbers  dearly  depend  on 
the  particular  structure  of  the  images,  for  these  types 
of  images  we  typically  find  that  on  the  order  of  75  to 
80  percent  of  the  available  zero-crossings  are  assigned  a 
correspondence  (and  that  this  usually  represents  on  the 
order  of  10  percent  of  the  image  for  normal  sized  DOG 
filters),  Of  these  matched  zero-crossings,  usually  on  the 
order  of  99,5  percent  of  them  arc  correct,  in  that  they  are 
matched  to  the  correct  zero-crossing  contour  in  the  second 
image.  Disparity  localization  refers  to  the  accuracy  of  the 
disparity  values  associated  with  a  match,  a  value  that  is  a 
function  of  the  localization  accuracy  of  the  Marr-Hildreth 
edge  detector  as  well  as  of  the  matching  process  itself.  An 
evaluation  of  the  localization  accuracy  of  the  algorithm  on 
these  images  is  currently  underway  jointly  with  ETL. 

A  different  algorithm,  which  also  represents  an  evolu¬ 
tion  of  the  original  stereo  theory,  has  been  developed 
by  Nishihara  with  the  goal  of  perfecting  a  high  speed, 
noise  tolerant  stereo  matcher,  Specifically  we  are  studying 
techniques  for  minimizing  a  matcher's  sensitivity  to  such 
distortions  in  noisy  signals  as  might  occur  in  low  contrast 
images  and  in  applications  where  lower  quality  cameras 
are  used.  Nishihara  has  found  that  noise  sensitivity  can  be 
reduced  significantly  by  trading  off  resolution  for  reliability 
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in  much  the  same  way  that  Marr  and  Poggio  (1970)  o;  iginally 
proposed  trading  off  resolution  for  disparity  range. 

He  has  implemented  a  prototype  matcher  -;r  r  these 
results  on  top  of  the  realtime  convolution  hardware  he 
developed  earlier  with  N.  Larson  (Nishihara  &  Larson, 
1981).  The  system  currently  produces  a  16  by  16  array 
of  depth  measurements  every  15  seconds  from  vidicon 
camera  images  having  order  10-20  percent  noise  levels.  The 
matching  volume  of  the  device  is  approximately  a  cube 
with  depth  resolution' somewhat  better  than  its  present  16 
by  16  spatial  resolution.  Conversion  to  microcode  from  lisp 
should  allow  a  doubling  of  the  resolution  obtained  while 
maintaining  or  reducing  the  matching  time. 

5.  Integrating  surface  maps 

Computational  vision  requires  the  construction  of  rich 
descriptions  of  surface  shape.  Marr  and  Nishihara’s  2^-D 
sketch  [Marr,  1982],  a  viewer-centered  description  of  the 
visible  surfaces  iri  a  scene,  is  an  important  intermediate  rep¬ 
resentation  on  the  road  to  surface  analysis  and,  ultimately, 
to  object  recognition. 

In  previous  reports  we  have  described  work  by  Grimson 
and  Terzopoulos  on  the  interpolation  of  shape  information 
in  locations  were  it  is  not  specified  exactly  by  the  image. 
In  addition  to  extensions  of  the  surface  interpolation 
theory  and  the  problem  of  computational  efficiency,  our 
recent  effort  in  the  recovery  and  representation  of  surface 
information  concentrated  on  the  problem  of  integrating 
information  of  surface  shape  from  different  sources.  This 
section  summarizes  the  work  by  Terzopoulos  and  by 
Grimson  in  this  areas.  The  following  section  describes  our 
research  in  a  related  area  -  the  problem  of  detecting  and 
dealing  with  discontinuities. 

Current  work  by  Terzopoulos  examined  four  problems 
in  the  visual  analysis  of  surfaces.  The  four  are:  (i)  the 
constraint  integration  problem;  (ii)  the  discontinuity 
problem;  (iii)  the  interpolation  problem;  and  (iv)  the 
computational  efficiency  problem.  Some  of  the  work 
on  interpolation  of  smooth  surfaces  from  raw,  scattered 
constraints  on  surface  shape  [Grimson,  1981a, b;  Brady  and 
Horn,  1983;  Terzopoulos,  1982],  and  investigations  info 
computational  efficiency,  which  came  to  fruition  in  the 
development  of  an  extremely  efficient  multilevel  surface 
reconstruction  algorithm  [Terzopoulos,  1982,  1983],  have 


been  described  in  previous  reports.  We  shall  therefore 
concentrate  here  on  recent  advances  in  our  study  of  the 
constraint  integration. 


Integrating  Constraints  from  Several  Visual  Sources 


Each  visual  modality  constitutes  a  distinct  source  of 
partial  information  constraining  surface  shape.  Processes 
such  as  stereopsis  and  analysis  of  motion  naturally  generate 
local  depth  constraints,  while  processes  such  as  shape 
from  shading,  texture,  and  contours  naturally  provide 
local  surface  orientation  constraints.  Surface  reconstruction 
necessitates  the  integration,  over  several  sources,  of  these 
two  classes  of  scattered  constraints. 

Surface  reconstruction  was  formulated  in  terms  of  a 
physical  model  —  a  variational  problem  describing  the 
equilibrium  of  a  thin,  flexible  plate  subject  to  constraints. 
It  involves  the  following  plate  energy  functional: 

tp(v)  ~  I  fn  —  (1  —tr)(vxxVyy  —  vly}dxdy. 

In  the  generalized  formulation,  the  influence  of  various 
constraints  on  the  plate  interpolating  surface  is  governed 
by  additive  penalty  functionals  [Terzopoulos,  1983b].  Depth 
constraints  are  handled  by  the  functional 


W  =~  £  P[x„y.)  v{xi,  yi)  ~  d{  ) 

while  orientation  constraints  are  handled  by 
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so  that  the  total  energy  functional  to  be  minimized  (over 
an  appropriate  Sobolev  space  of  admissible  functions)  is 
E[v)  =  £p(v)  -f  £d(v)  -f  £0(v).  The  reconstructed  surface 
is  the  minimizing  function  v  =  u{i,y)  representing  a  thin 
plate  surface  at  equilibrium,  subject  to  the  influence  of 
either  scattered  depth  constraints,  or  scattered  orientation 
constraints,  or  both.  In  this  way,  all  available  constraints 
generated  by  various  sources  are  employed  as  an  integrated 
whole,  and  the  reconstructed  surface  is  the  best  possible  in 
view  of  the  available  information. 


To  summarize,  Terzopoulos’  work  in  surface  reconstruc¬ 
tion,  as  described  in  the  last  report,  has  been  successfully 
generalized  to  deal  with  the  constraint  integration  problem. 
He  is  currently  refining  and  testing  his  computational  theory 
of  visible-surface  representations,  aiming  toward  a  more 
complete  understanding  of  the  structure  of  the  2|-D  sketch. 
In  addition,  he  is  exploring  the  applicability  of  techniques 
that  have  proven  to  be  valuable  in  surface  reconstruction, 
such  as  the  finite  element  method  and  multilevel  relaxation 
methods,  to  other  problems  in  low-  and  intermediate-level 
vision,  including  lightness,  shape  from  shau  5,  and  optical 
flow.  Preliminary  results  are  encouraging. 

Combining  stereo  and  shape-from-shading 

Previous  reports  have  described  our  work  on  surface 
reconstruction,  mostly  based  on  constructing  complete 
surface  representations,  consistent  with  the  image  irradiance 
information,  from  stereo  depth  data.  While  acceptable 
surface  reconstructions  can  be  obtained  strictly  from  depth 
information,  it  is  clear  that  additional  boundary  constraints 
would  lead  to  more  accurate  surface  representations.  In 
order  to  seek  such  additional  boundary  information,  we 
have  investigated  the  mathematical  relationship  between 
the  Marr-Poggio  theory  of  stereo  and  Horn’s  work  on 
shape  from  shading.  Crimson  [1982b]  lias  shown  that  if  the 
reflectance  map  [Horn  and  Sjobeig  1979]  is  known,  then 
given  a  pair  of  stereo  matched  depth  contours  it  is  possible  to 
determine  the  surface  normal  along  the  depth  contour.  The 
proof  suggests  a  technique  for  finding  surface  normals  that 
is  essentially  analogous  to  photometric  stereo,  pioneered 
by  Horn,  Woodham,  and  Silver  [1978].  Conversely,  it  is 
possible  in  principle  to  determine  certain  visible  surface 
characteristics  from  stereo  information.  Suppose  that  the 
reflectance  map  is  of  the  form 


R(fi)  =  p 


(1  —  a)(n  —  5)  a(n  — 


where  p  is  the  albedo,  a  determines  the  convex  combination 
of  the  specular  and  matte  components  of  the  reflectance, 
and  k  is  the  degree  of  specularity.  Provided  one  can  identify 
points  of  high  curvature  along  the  zero-crossing  contours, 
it  is  possible  to  determine  the  values  of  the  parameters  k,  p 
and  a  for  the  corresponding  portion  of  the  image  (since  the 
values  could  change  with  changing  surface  material).  Using 


an  interocular  separation  consistent  with  the  separation  of 
human  eyes,  the  technique  is  most  effective  at  a  distance 
of  about  one  meter.  The  technique  may  find  application  to 
’ vide  angle  stereo,  however,  where  the  numerical  stability 
of  the  algorithm  is  expected  to  increase. 

6.  Finding  Discontinuities 

The  geometric  properties  of  surfaces  are  almost  cer¬ 
tain  to  be  discontinuous  at  certain  locations  in  the  scene. 
Depth  discontinuities  occur  along  occluding  c  ntours,  while 
orientation  discontinuities  occur  along  surface  creases. 
Discontinuities  in  surface  geometry  are  usually,  but  not 
always,  reflected  in  image  intensities.  Terzopoulos  [1983a] 
decomposes  the  discontinuity  problem  into  three  sub¬ 
problems:  (i)  the  detection  of  discontinuities  in  surface 
geometry,  (ii)  the  explicit  representation  of  these  discon¬ 
tinuities,  and  (iii)  a  characterization  of  their  influence  on 
visible  surface  reconstruction. 

He  argues  that  the  first  subproblem  has  a  widespread 
basis  in  early  visual  processing.  The  detection  of  discon¬ 
tinuities  is  certain  to  require  the  conjunction  of  simultaneous 
events  in  several  visual  modalities;  for  example,  the  coin¬ 
cidence  of  texture  boundaries  or  motion  boundaries  with 
sudden  disparity  changes.  If  early  visual  processes  are  made 
sensitive  to  such  events,  many  prominent  discontinuities 
in  surface  geometry  may  be  hypothesized  before  surface 
reconstruction  begins.  On  the  other  hand,  discontinuities 
which  are  subtle  or  hidden  in  the  primal  sketch,  such  as 
those  which  typically  occur  in  random  dot  stereograms  must 
await  detection  until  the  surface  reconstruction  stage,  when 
a  full  depth  map  becomes  available.  Terzopoulos  has  ex¬ 
perimented  with  a  simple  method  for  detecting  and  localiz¬ 
ing  depth  discontinuities  during  the  surface  reconstruction 
process.  Localization  involves  finding  inflections  in  the 
bending  moments  of  the  plate  interpolating  surface,  while 
detection  relies  on  the  occurrence  of  significant  disparity 
gradients.  The  method  may  be  conceptualized  as  a  type 
of  edge  detection  over  a  tentative,  dense  depth  map,  and 
it  amounts  to  thresholding  according  to  the  magnitude  of 
the  surface  gradient  at  zero  crossings  of  the  Laplacian  of 
the  surface.  Constraints  on  binocular  imaging  geometry  can 
dictate  appropriate  bounds  on  the  threshold. 

The  thin  plate  surface  reconstruction  model  also 
suggested  how  to  apply  the  finite  element  method  to 
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appropriately  inhibit  surface  interpolation  across  discon- 
tinuites,  once  they  have  been  made  explicit.  In  par¬ 
ticular,  the  surface  reconstruction  algorithm  was  general¬ 
ized  to  handle  depth  discontinuities  (i.e.,  occluding  con¬ 
tours)  and  surface  orientation  discontinuities  (i.e.,  creases). 
Generalization  involves  “breaking”  the  interpolating  plate 
along  depth  discontinuities  and  “joining”  plate  patches 
by  strips  of  membrane  along  orientation  discontinuities, 
thus  reconstructing  piecewise  smooth  surfaces.  [The  math¬ 
ematical  details  are  presented  in  Terzopoulos,  1983b],  Once 
discontinuities  have  been  detected,  say,  by  the  method 
described  in  the  preceding  paragraph,  the  reconstructed 
surface  may  be  improved  by  a  few  additional  relaxation 
iterations. 

7.  Shape  description 
Smoothed  local  symmetries 

The  description  of  two-  and  three-dimensional  shape  is 
crucial  for  recognition.  Brady  [1982a,  1982b]  has  developed 
a  representation  of  two-dimensional  shapes  that  combines 
certain  features  of  two-dimensional  projections  of  general¬ 
ized  cylinders  [Nevatia  and  Binford  1977,  Brooks  1981]  and 
the  symmetric  axis  transform  (SAT)  [Blum  and  Nagel  1978]. 
The  representation  has  been  applied  to  determine  where  to 
choose  grasp  points  on  a  lamina  for  a  two-fingered  robot 
hand. 

The  smoothed  local  symmetries  representation  has  four 
components.  First,  local  symmetry  is  defined  in  a  way  that 
differs  from  that  implicit  in  the  SAT.  Second,  axes  that  are 
smooth  loci  of  local  symmetries  are  computed.  In  this  way, 
smoothness  of  axes  is  made  explicit,  rather  than  being  left 
implicit  as  in  the  symmetric  axis  transform.  Third,  axes 
whose  region  of  support  is  wholly  subsumed  by  the  support 
of  some  other  axis  are  deleted.  The  resulting  smoothed 
local  symmetries  are  given  a  parametric  description  called 
a  frame.  Finally,  a  shape  is  decomposed  into  sub-objects  for 
which  smoothed  local  symmetry  descriptions  are  computed 
individually.  The  axes  act  as  local  coordinate  frames  and 
constrain  the  generation  of  descriptions  of  an  entire  shape 
by  combining  the  descriptions  of  subshapes. 

A  pilot  implementation  of  smoothed  local  symmetries 
was  reported  in  [Brady  1982c],  It  repeatedly  used  an  algo¬ 
rithm,  based  on  the  mean  value  theorem,  for  determining 
the  points  at  which  a  line  entering  the  shape  at  a  given 


orientation  to  the  tangent  emerges  from  the  shape.  In 
this  way  the  local  symmetries  at  a  point  could  be  found 
iteratively.  The  pilot  implementation  worked  well,  but  was 
very  slow. 

Recently  Asada  and  Brady  [1983]  have  developed  an 
algorithm  that  computes  an  approximation  to  the  smoothed 
local  symmetries  of  a  shape.  First,  a  set  of  feature  points 
are  computed  on  the  shapes  bounding  contour,  as  found  by 
the  Canny  edge  detector.  The  feature  points  are  points  of 
high  curvature  or  points  of  inflexion,  and  they  are  found 
by  a  process  analogous  to  edge  finding  but  applied  to 
the  orientation  of  the  curve  (a  one-dimensional  function 
of  arclcngth).  Features  analogous  to  those  computed  for 
the  original  primal  sketch  [Marr  1976]  are  extracted  and 
interpreted.  Second,  the  shape  is  approximated  by  best 
fitting  straight  lines  and  circles  to  the  feature  points  found 
in  the  first  stage.  Asada  and  Brady  have  worked  out 
the  smoothed  local  symmetries  generated  by  two  contours 
of  constant  curvature,  and  these  are  fit  to  the  segments 
produced  in  the  second  stage.  Finally,  the  smoothed  local 
symmetries  are  used  to  match  a  database  of  shape  models  for 
recognition  and  inspection.  Her  ■  and  Brady  have  developed 
a  sampling  algorithm  for  computing  the  smoothed  local 
symmetries  of  a  shape.  The  main  emphasis  of  their  work  is 
developing  algorithms  for  removing  locally  plausible  axes 
that  are  of  minor  significance  globally. 

Bagley  and  Brady  [1983]  generate  elaborate  shape 
descriptions  using  a  hierarchy  of  shape  models  incorporat¬ 
ing  general  geometric  knowledge  and,  at  a  higher  level, 
application-specific  information.  These  models  combined 
with  concavities  in  the  boundary  allow  isolation  of  sub¬ 
shapes.  They  associate  with  each  subshape  a  local  refe-enec 
frame  to  characterize  the  joining  of  subshapes  and  to  help 
choose  among  multiple  interpretations. 

The  computation  of  shape  from  contour 

An  important  goal  of  early  vision  is  the  computation 
of  a  representation  of  the  orientation  of  visible  surfaces. 
Many  processes  contribute  to  achieving  this  goal,  stereop- 
sis  and  structure-from- motion  being  the  most  studied  in 
image  understanding.  Three  other  important  contribut¬ 
ing  processes  are  shape-from-contour,  shape-from-texture- 
gradients,  and  shape-from-shading.  Several  psychophysical 
demonstrations  show  that  shape-from-contour  is  significantly 
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more  powerful  than  shape-from-texture-gradients.  Similarly, 
Barrow  and  Tenenbaum  [1981,  Figure  1.3  ff]  suggest  that 
shape-froin-eontour  is  a  more  effective  clue  to  shape  than 
shape- from- shading. 

Brady  and  Yuille  have  investigated  the  computation  of 
shape-froin-contour.  Many  shapes  are  perceived  as  images 
of  surfaces  which  arc  oriented  out  of  the  picture  plane. 
Slant  judgements  are  not  determined  by  familiarity  with 
contours,  but  on  more  general  knowledge  of  shapes  and 
surfaces.  The  method  proposed  by  Brady  and  Yuille  is 
based  on  such  general  knowledge,  namely  a  preference  for 
symmetric,  or  at  least  compact,  surfaces.  Note  that  the 
contour  does  not  need  to  be  closed  in  order  to  be  interpreted 
as  oriented  out  of  the  image  plane.  In  general,  contours  are 
interpreted  as  curved  three-dimensional  surfaces. 

Brady  and  Yuille  develop  an  extremum  principle  for 
determining  three-dimensional  surface  orientation  from 
a  two-dimensional  contour.  Initially,  they  work  out  the 
extremum  nrinciple  for  contours  that  are  closed  and  that 
are  assumed  a  priori  be  the  images  of  planar  surfaces. 
They  discuss  how  to  i  nd  this  approach  to  open  contours 
and  how  to  interpret  contours  as  curved  surfaces. 

The  extremum  principle  maximizes  a  familiar  measure 
of  the  compactness  or  symmetry  of  an  oriented  surface, 
namely  the  ratio  of  the  area  to  the  square  of  the  perimeter. 
It  is  shown  that  this  measure  is  at  the  heart  of  the  maxi¬ 
mum  likelihood  approach  to  shape-from-contour  developed 
by  Witkin  [1981]  and  Davis,  Janos,  and  Dunn  [1982].  The 
maximum  likelihood  approach  has  had  some  success  inter¬ 
preting  irregularly  shaped  objects.  However,  the  method  is 
ineffective  when  the  distribution  of  image  tangents  is  not 
random,  as  is  the  case,  for  example,  when  the  image  is  a 
regular  shape,  such  as  an  ellipse  or  a  parallelogram.  The  ex¬ 
tremum  principle  interprets  regular  figures  correctly.  Brady 
and  Yuilie  show  that  the  maximum  likelihood  method  ;.p- 
proximates  the  extremum  principle  for  irregular  figures;  but 
that  the  maximum  likelihood  method  does  not  compute  the 
correct  slant  for  an  ellipse.  Witkin  [1981,  Figure  5]  provides 
empirical  evidence  that  the  maximum  likelihood  method 
computes  a  good  approximation  to  the  perceived  tilt  but 
underestimates  the  slant.  Brady  and  Yuille  prove  that  the 
maximum  likelihood  method  consistently  overestimates 
the  slant  of  an  ellipse.  A  more  thorough  investigation  of 
the  difference  between  the  Extremum  Principle  and  the 


Maximum  Likelihood  method  is  needed. 

Kanade  [1981,  page  424]  has  suggested  a  method 
for  determining  the  three-dimensional  orientation  of  skew- 
symmetric  figures,  under  the  ’'heuristic  assumption”  that 
such  figures  are  interpreted  as  oriented  real  symmetries. 
Brady  and  \ uille  prove  that  the  extremum  principle 
necessarily  interprets  skew  symmetries  as  oriented  real 
symmetries,  thus  dispensing  with  the  need  for  any  heuristic 
assumption  to  that  effect.  Kanade  shows  that  there  is  a 
one-parameter  family  of  possible  orientations  of  a  skew- 
symmetric  figure,  forming  a  hyperbola  in  gradient  space. 
He  suggests  that  the  minimum  slant  member  of  the  one- 
parameter  family  is  perceived.  In  the  special  case  of  a  real 
symmetry,  Kanade’s  suggestion  implies  that  symmetric 
shapes  are  perceived  as  lying  in  the  image  plane,  that 
is  having  zero  slant.  It  is  clear  from  the  example  of  an 
ellipse  that  this  is  not  correct.  Our  method  interprets  real 
symmetries  correctly. 

8.  Object  acquisition  and  shape  from  shading 

Photometric  stereo  as  developed  by  Horn,  Woodham 
and  Silver  [Horn,  et.  al . ,  1978;  Woodham,  1981]  provides 
shape  and  surface  orientation  from  multiple  images  of  the 
same  scene,  taken  under  different  conditions  of  incident 
illumination. 

Suppose  two  images  are  obtained  by  varying  the 
direction  of  the  incident  illumination.  Each  picture  element 
in  the  two  images  corresponds  to  the  same  physical 
point,  since  the  imaging  geometry  remains  unchanged. 
The  reflectance  map  is  changed,  however,  and  the  two 
values  for  each  point  can  determine  the  surface  orientation. 
(Three  views  provide  complete  disambignation  in  all  cases.) 
Photometric  stereo  can  be  implemented  very  efficiently  in 
terms  or  a  look  up  table  set  up  in  an  initial  calibration  phase 
in  which  an  object  of  known  shape  is  imaged  under  the 
different  lighting  conditions.  Recently,  Ikeuehi  and  Horn 
have  applied  this  technique  to  the  'Txficult  problem  of  bin 
picking. 

One  of  the  remaining  obstacles  to  the  widespread  ap¬ 
plication  of  industrial  robots  is  their  inability  to  deal  with 
parts  that  are  not  precisely  positioned.  Present  methods  for 
automating  assembly  operations  require  separate  feeding 
of  the  parts,  with  position  and  attitude  carefully  control¬ 
led.  Ikeuehi  and  Horn  have  demonstrated  a  system  for 
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ABSTRACT 


Our  principal  objective  in  this  research 
program  is  to  obtain  solutions  to  fundamental 
prohlems  in  computer  vision;  particularly  those 
problems  that  are  relevant  to  the  development  of 
an  automated  capability  for  interpreting  aerial 
imagery  and  the  production  of  cartographic 
products . 
rl,  s  'i<h  r  f 

Our  plan  is  to  advance  the  state  of  the  art 
in  selected  core  areas  such  as  stereo  compilation, 
feature  extraction,  linear  delineation,  and  image 
matching;  also,  to  develop  an  ""expert  system1' 
control  structure  which  will  allow  a  human 
operator  to  communicate  with  the  computer  at  a 
problem  oriented  level,  and  guide  the  behavior  of 
the  low  level  interpretation  algorithms  doing 
detailed  image  analysis.  , 
i  t  if  ^ 

Finally,  we  pllan  to  use  the  DARPA/DMA  Testbed 
as  a  mechanism  for  transporting  both  our  own  and 
IU  community  advances,  in  image  interpretation  and 
scene  analysis,  to  DMA,  ETL,  and  other  members  of 
the  user  community. 


I  INTRODUCTION 

A  major  focus  of  our  current  work  is  the 
construction  of  an  Expert  System  for  Stereo 
Compilation  and  Feature  Extraction.  Our  intent  in 
this  effort  is  to  develop  a  system  that  provides  a 
framework  for  allowing  higher  level  knowledge  to 
guide  the  detailed  interpretation  of  imaged  data 
by  autonomous  scene  analysis  techniques.  Such  a 
system  would  allow  symbolic  knowledge,  provided  by 
higher  level  knowledge  sources,  to  automatically 
control  the  selection  of  appropriate  algorithms, 
adjust  their  parameters,  and  apply  them  in  the 
relevant  portions  of  the  image. 

Recognizing  the  difficulty  of  completely 
automating  the  interpretation  process,  the  expert 
system  will  be  structured  so  that  a  human  operator 
can  provide  the  required  high  level  information 
when  there  are  no  reliable  techniques  for 
automatically  extracting  this  information  from  the 
available  imagery.  As  new  research  results  become 
available,  the  level  of  human  interaction  can  be 
progressively  reduced. 

The  expert  system  we  are  building  can  thus  be 
viewed  as  an  intelligent  user-flevel  interface  for 
guiding  semiautomated  image  processing  activities. 
Such  a  system  is  envisioned  as  a  rule-based  system 
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Development  of  the  expert  system  control 

structure  is  a  research  task  still  in  an  early 
stage  of  accomplishment.  The  remainder  of  this 
report  will  describe  progress  in  research 
supporting  the  development  of  potential  scene 
analysis  components  of  the  system,  as  well  as  other 
Image  Understanding  research  of  a  more  basic 

nature.  We  also  briefly  describe  the  status  of  the 
DARPA/DMA  Testbed  effort  now  approaching 

completion. 


II  RESEARCH  PLANS  AND  PROGRESS 

A.  Development  of  Methods  for  Modeling  and  Using 

Physical  Constraints  in  Image  Interpretation. 

Our  goal  in  this  work  is  to  develop  methods 
that  will  first  allow  us  to  produce  a  sketch  of  the 
physical  nature  of  a  scene  and  the  illumination  and 
Imaging  conditions,  and  next  permit  us  to  use  this 
physical  sketch  to  guide  and  constrain  the  more 
detailed  descriptive  processes  —  such  as  precise 
stereo  mapping. 

Our  approach  is  to  develop  models  of  the 
relationship  between  ph’ sical  objects  in  the  scene 
and  the  Intensity  patterns  they  produce  in  an  image 
(e.g.,  models  that  allow  us  to  classify  intensity 
edges  in  an  image  as  either  shadow,  or  occlusion, 
or  surface  intersection,  or  material  boundaries  in 
the  scene);  models  of  the  geometric  constraints 
Induced  by  the  projective  imaging  process  (e.g., 
models  that  allow  us  to  determine  the  location  and 
orientation  of  the  camera  that  acquired  the  image, 
location  of  the  vanishing  points  induced  by  the 
interaction  between  scene  and  camera,  location  of  a 
ground  plane,  etc.);  and  models  of  the  illumination 
and  Intensity  transformations  caused  by  the 
atmosphere,  light  reflecting  from  scene  surfaces, 
and  the  film  and  digitization  processes  that  result 
in  the  computer  representation  of  the  image. 

These  models,  when  instantiated  for  a  given 
scene,  provide  us  with  the  desired  "physical" 
sketch.  We  are  assembling  a  "constraint-based 


-  I  . 


Poggio,  T.,  Nishihara,  II.  K.,  Nielsen,  K.  R.  K.  “Zero- 
crossings  and  spatiot.einporal  interpolation  in  vision:  aliasing 
and  electrical  coupling  between  sensors”.  MIT  A.  I.  Memo 
675,  1982. 

Stevens,  K.  A.  “The  visual  interpetation  of  surface  con¬ 
tours".  Art.  Inteli,  17,  47-75,  1981. 

Terzopoulos,  D.  “Multi-level  reconstruction  of  visual  sur¬ 
faces".  MIT  A.  I.  Memo  671,  1982. 

Terzopoulos,  D.  "The  role  of  constraints  and  discontinuities 
in  visible-surface  reconstruction”,  Proc.  Int.  Jt.  Conf. 
Art.  Inteli.,  Karlsruhe,  1983. 

Terzopoulos,  D.  "Multi  level  reconstruction  of  visual  sur¬ 
faces".  MIT  A.  I.  Memo  671,  1983b.  To  appear  in 
Multircsolution  Image  Processing  and  Analysis,  A  Roscnfeld, 
ed. 

Within,  A.  P.  “Shape  from  contour”.  Ph.D.  thesis,  MIT. 
Also  MIT  A. I.  Tech.  Report  589,  1980. 

Woodham,  J.  R.  “Analyzing  images  of  curves  surfaces”. 
Art.  'inteli,  17,  117-141,  1981. 

Ullman,  S.,  Hildreth,  E.  C.  “The  measurement  of  visual 
motion”.  In:  Physical  and  Biological  Processing  of  Images, 
Braddick  and  Sleigh,  eds.,  Springer- Verlag,  Berlin,  1983. 

Yuille,  A.  “Zero  crossings  on  lines  of  curvature”.  Submitted 
to  AAAI  Conf.,  Washington,  D.  C.,  Sept.,  1983. 


stereo  system"  Lhat  can  use  this  physical  sketch  to 
resolve  the  ambiguities  that  defeat  conventional 
approaches  to  stereo  modeling  of  scenes  (e.g., 
urban  scenes  or  scenes  of  cultural  sites)  for  which 
the  images  are  widely  separated  in  either  space  or 
time,  or  for  which  there  are  large  featureless 
areas,  or  a  significant  number  of  occlusions. 

Recent  publications  of  our  work  in  this  area 
are  cited  in  the  references  [1-4,  9-12]. 

B.  Stereo  Compilation:  Image  Matching  and 

Interpolation 

We  are  implementing  a  complete  state-of-the- 
art  stereo  system  that  produces  dense  range  images 
from  given  pairs  of  intensity  images.  We  plan  to 
use  this  system  both  as  a  framework  for  our  stereo 
research,  and  as  the  base  component  of  our  planned 
expert  system. 

There  are  five  components  of  this  stereo 
system:  a  rectifier,  a  sparse  matcher,  a  dense 
matcher,  an  interpolator,  and  a  projective  display 
module.  The  rectifier  estimates  the  parameters  and 
distortions  associated  with  the  imaging  process, 
the  photographic  process,  and  the  digitization. 
These  parameters  are  used  to  map  digitized  image 
coordinates  onto  an  ideal  image  plane.  The  sparse 
matcher  performs  two-dimensional  searches  to  find 
several  matching  points  in  the  two  images,  which  it 
uses  to  compute  a  relative  camera  model.  The  dense 
matcher  tries  to  match  as  many  points  as  possible 
in  the  two  images.  It  uses  the  relative  camera 
model  to  constrain  the  searches  to  one  dimension, 
along  epipolar  lines.  The  interpolator  computes  a 
grid  of  range  values  by  interpolating  between  the 
matches  found  by  the  dense  matcher.  The  projective 
display  module  allows  interactive  examination  of 
the  computed  3-D  model  by  generating  2-D  projective 
views  of  the  model  from  arbi  atily  selected 
locations  in  space.  Initial  Versions  of  all 
components  of  the  system  have  been  implemented. 

Present  research  in  this  cask  is  focused 
primarily  on  the  image  correspondence  (matching) 
and  interpolation  problems.  With  respect  to  image 
matching,  the  following  major  issues  are  being 
addressed: 

*  What  is  a  correct  match? 

*  How  does  one  measure  the  performance  of  a 
matcher  ? 

*  What  causes  existing  matching  techniques  to 
fail? 

*  How  can  one  i-'prove  the  performance  of 

matching  techniques? 

Since  there  are  no  reliable  analysis 
techniques  for  evaluating  the  performance  of 

matching  algorithms  when  applied  to  real  world 

images,  we  must  evaluate  them  by  extensive  testing. 
To  expedite  such  testing,  a  database  of  images  and 
ideal  match  data  (ground  truth)  is  being  assembled. 
For  example,  we  have  acquired  data  from  the  ETL 
Phoenix  test  site  that  were  produced  specifically 
for  testing  matching  techniques.  Every  point  in 
the  database  we  are  constructing  contains 
annotations  that  indicate  the  categories  of 

matching  problems  for  that  point,  and  other 


information  that  might  be  useful  to  evaluate  the 
performance  or  guide  the  application  of  matching 
techniques. 

We  are  currently  investigating  a  hypothesize  - 
verify  approach  to  local  matching.  Potential 
matches  are  verified  by  examining  the  image  for 
compliance  with  the  assumptions  of  the  matching 
operator's  model.  For  example,  area  correlation 
matching  operators  assume  that  correctly  registered 
image  patches  will  differ  only  by  Gaussian  noise. 
A  simple  verification  technique  is  to  examine  the 
statistics  of  the  point-by-point  difference  between 
the  hypothesized  alignment  of  the  patches  for 
conformance  with  that  model.  Image  anomalies,  such 
as  moving  objects  or  occluding  contours,  will 
typically  produce  a  difference  image  that  has  a 
highly  structured  geometry,  indicating  the  shape 
and  location  of  the  anomaly.  Such  anomalous  areas 
can  be  removed  from  the  region  over  which  the 
correlation  is  computed,  and  the  process  iterates 
until  either  an  acceptable  match  criterion  is 
satisfied,  or  too  many  points  are  removed  from  the 
region. 

In  many  cases  (e.g.,  occlusion  and  featureless 
areas)  local  matching  techniques  are  not  capable  of 
producing  the  required  correspondences  over  regions 
of  significant  extent.  We  intend  to  use  the 
information  provided  by  the  "physical  sketch"  (see 
previous  section)  to  detect  such  situations,  and  to 
select  alternative  means  for  obtaining  the  required 
depth  information. 

As  indicated  above,  when  a  stereo  pair  of 
images  are  matched,  we  generally  can  do  no  better 
than  to  compute  a  sparse  depth  map  of  the  imaged 
scene.  However,  for  many  tas  .s  a  sparse  depth  map 
is  inadequate.  We  want  a  complete  model  that 
accurately  portrays  the  scene's  surfaces.  To 
achieve  this  goal,  we  must  be  able  to  obtain  the 
missing  surface  shape  information  from  the  shading 
of  the  images  of  the  stereo  pair. 

To  understand  the  relationship  between  image 
shading  and  surface  shape,  we  built  a  differential 
model  [10,11]  that  relates  shape  and  shading  but, 
unfortunately,  does  not  provide  a  complete  basis 
tor  a  shape  recovery  algorithm  [12].  However,  the 
information  available  in  image  shading  does  allow 
the  building  of  a  surface  interpolation  algorithm 
that  finds  a  surface  that  is  consistent  with  the 
image  shading.  We  are  proceeding  with  such  a 
development. 

As  image  shading  alone  does  not  provide 
sufficient  information  to  find  surface  orientation, 
further  shape  information  sources  in  the  image  are 
needed.  We  are  evaluating  additional  scene 
attributes  that  encode  shape  information  in  their 
image,  and  the  models  necessary  to  recover  the 
corresponding  shape  information. 

C.  Feature  Extraction :  Scene  Description, 

Par  titionlng,  and  Labeling 

Our  current  research  in  this  area  addresses 
two  related  problems:  (1)  representing  natural 
shapes  such  as  mountains,  vegetation,  and  clouds, 
and  (2)  computing  such  descriptions  from  image 
data.  The  first  step  towards  solving  these 
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problems  Is  to  obtain  a  model  of  natural  surface 
shapes . 

A  model  of  natural  surfaces  Is  extremely 
important  because  we  face  problems  that  seem 
impossible  to  address  with  standard  descriptive 
computer  vision  techniques.  How,  for  instance, 
should  we  describe  the  shape  of  leaves  on  a  tree? 
Or  grass?  Or  clouds?  When  we  attempt  to  describe 
such  common,  natural  shapes  using  standard  shape- 
primitive  representations,  the  r  'suit  is  an 
unrealistically  complicated  model  of  something 
that,  viewed  introspectively ,  seems  very  simple. 
Furthermore,  how  can  we  extract  3-D  information 
from  the  image  of  a  textured  surface  when  we  have 
no  models  that  describe  natural  surfaces  and  how 
they  evidence  themselves  in  the  image?  The  lack  of 
such  a  3-D  model  has  restricted  image  texture 
descriptions  to  being  ad  hoc  statistical  measures 
of  the  image  intensity  surface. 

Fractal  functions,  a  novel  class  of  naturally- 
arising  functions,  are  a  good  choice  for  modeling 
natural  surfaces  because  many  basic  physical 
processes  (e.g.,  erosion  and  aggregation)  produce  a 
fractal  surface  shape,  and  because  fractals  are 
widely  used  as  a  graphics  tool  for  generating 
natural-looking  shapes.  Additionally,  we  have 
recently  conducted  a  survey  of  natural  imagery  and 
found  that  a  fractal  model  of  imaged  3-D  surfaces 
furnishes  an  accurate  description  of  both  textured 
and  shaded  image  regions,  thus  providing  validation 
of  this  physics-derived  model  for  both  image 
texture  and  shading. 

Encouraging  progress  relevant  to  computing  3-D 
information  from  imaged  data  has  already  been 
achieved  by  use  of  the  fractal  model.  We  have 
derived  a  test  to  determine  whether  or  not  the 
fractal  model  is  valid  for  particular  image  data, 
developed  an  empirical  method  for  computing  surface 
roughness  from  image  data,  and  made  substantial 
progress  in  the  areas  of  shape-f rom-texture  and 
texture  segmentation.  Characterization  of  image 
texture  by  means  of  a  fractal  surface  model  has 
also  shed  considerable  light  on  the  physical  basis 
for  several  of  the  texture  partitioning  techniques 
currently  in  use,  and  made  it  possible  to  describe 
image  texture  in  a  manner  that  is  stable  over 
transformations  of  scale  and  linear  transforms  of 
intensity. 

The  computation  of  a  3-D  fractal-based 
representation  from  actual  image  data  has  been 
demonstrated.  This  work  has  shown  the  potential  of 
a  fractal-based  representation  for  efficiently 
computing  good  3-D  representations  for  a  variety  of 
natural  shapes,  including  such  seemingly  difficult 
cases  as  mountains,  vegetation,  and  clouds. 

This  research  is  expected  to  contribute  to  the 
development  of  (1)  a  computational  theory  of  vision 
applicable  to  natural  surface  shapes,  (2)  compact 
representations  of  shape  useful  for  natural 
surfaces,  and  (3)  real-time  regeneration  and 
display  of  natural  scenes.  We  also  anticipate 
adding  significantly  to  our  understanding  of  the 
way  humans  perceive  natural  scenes. 

Details  of  this  work  can  be  found  in  Pentland 

[8]. 


D.  Linear  Delineation  and  Partitioning 

A  basic  problem  in  machine  vision  research  is 
how  to  produce  a  line  sketch  that  adequately 
captures  the  semantic  information  present  in  an 
image.  (For  example,  maps  are  stylized  line 
sketches  that  depict  restricted  types  of  scene 
information.)  Before  we  can  hope  to  attack  the 
problem  of  semantic  interpretation,  we  must  solve 
some  open  problems  concerned  with  direct  perception 
of  line-like  structure  in  an  image  and  with 
decomposing  complex  networks  of  line-like 
structures  into  their  primitive  (coherent) 

components.  Both  of  these  problems  have  important 
practical  as  well  as  theoretical  implications. 

For  example,  the  roads,  rivers,  and  rail-lines 
in  aerial  images  have  a  line-like  appearance. 

Methods  for  detecting  such  structures  must  be 
general  enough  to  deal  with  the  wide  variety  of 

shapes  they  can  assume  in  an  image  as  they  traverse 
natural  terrain. 

Most  approaches  to  object  recognition  depend 
on  using  the  information  encoded  in  the  geometric 
shape  of  the  contours  of  the  objects.  When  objects 
occlude  or  touch  one  another,  decomposition  of  the 
merged  contours  is  a  critical  step  in 

interpretation. 

We  have  recently  made  significant  progress  in 
both  the  delineation  and  the  partitioning  problems. 
Our  work  in  delineation  [3]  is  based  on  the 
discovery  of  a  new  perceptual  primitive  that  is 
highly  effective  in  locating  line-like  (as  opposed 
to  edge-like)  structure. 

Our  work  on  decomposing  linear  structures  into 
coherent  components  [6]  is  based  on  the  formulation 
of  two  general  principles  that  appear  to  have 
applicability  over  a  wide  range  of  problems  in 
machine  perception.  The  first  of  these  principles 
asserts  that  perceptual  decisions  must  be  stable 
under  at  least  small  perturbations  of  both  the 
imaging  conditions  and  the  decision  algorithm 
parameters.  The  second  principle  is  the  assertion 
that  perception  is  an  explanatory  process: 
acceptable  precepts  must  be  associated  with 
explanations  that  are  both  complete  (i.e.,  they 
explain  all  the  data)  and  believable  (i.e.,  they 
are  both  concise  and  of  limited  complexity). 

These  new  delineation  and  partitioning 
algorithms  have  produced  excellent  results  :'n 
experimental  tests  on  real  data  [5,6]. 


Ill  STATUS  OF  THE  DARPA/DMA 
IMAGE  UNDERSTANDING  TESTBED 

The  DARPA/DMA  Image  Understanding  Testbed 
established  at  SRI  as  part  of  the  DARPA  Image 
Understanding  research  program  constitutes  a 
coherent  body  of  software  running  in  a  standard 
hardware  environment.  Demonstrations  of  the 
features  and  capabilities  of  all  IU  community 
contributed  software  are  available;  detailed 
evaluations  have  been  carried  out  for  selected 
modules  (e.g.,  see  the  paper  by  K.  Laws  [7]  in 
these  proceedings).  In  this  capacity,  the  Testbed 
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Abstract 

V 

Research  on  intelligent  systems  for  image  understanding  focusses 
on  a  successor  to  the  ACRONYM  system,  and  its  application  in  a 
rule- based  stereo  mapping  and  interpretation  system.  Some  ele¬ 
ments  of  a  rule-based  stereo  system  have  been  implemented.  A 
new  modeling  system  is  under  construction,  and  a  new  graphics 
system  for  display  of  generalized  cylinders  has  achieved  initial 
results.  Research  has  continued  on  scgmcntation/aggrcgalion 
in  the  figure-ground  problem  for  grouping  candidate  objects. 
Implementation  experiments  arc  underway  for  an  array  of  vi¬ 
sion  processors.  Fundamental  mathematical  results  have  been 
obtained  on  matching  processes.  Inference  rules  for  interpreting 
surfaces  from  images  were  demonstrated  formally  in  a  mathe¬ 
matical  logic  programming  system.  Results  have  been  obtained 
in  specializing  certain  vision  programs  by  automatic  methods  to 
produce  cllicicnt  programs. 

The  objectives  of  this  research  are  to  develop  algorithms  for 
high  performance  image  understanding  modules,  to  implement 
an  intelligent  vision  system,  and  to  demonstrate  its  application 
in  pholointerprelation  and  cartography.  The  ACRONYM  sys¬ 
tem  was  developed  as  the  first  intelligent  system.  Research  has 
shifted  to  its  SUCCESSOR. 

A  rule-based  stereo  mapping  system  is  under  construction. 
Various  members  of  the  group  have  built  elements  for  a 
demonstration  described  in  [Baker  83].  This  work  was  supported 
in  part  by  RADC.  These  include:  an  evaluation  of  I’enfhind’s 
shape  from  shading  program;  an  extended  version  of  Baker’s  pro¬ 
gram  which  includes  edges  from  [Marimonl  82);  a  stereo  registra¬ 
tion  and  rectification  program  by  Metier;  generic  building  models 
and  typical  building  examples  by  Gray;  stereo  matching  of  or¬ 
thogonal  trihedral  vertices  by  Malik  and  Binford;  monocular  and 
stereo  inference  rules  by  Malik  and  Binford;  example  rules  for  a 
rule-based  stereo  system. 

Miller  and  Lowry  have  continued  progress  toward  building  a 
small  array  of  image  processors  [Lowry  82].  Other  work  has 
begun  in  the  architecture  of  algorithms  for  image  understanding. 

Cowan  has  begun  implementation  of  a  new  modeling  system  for 
SUCCESSOR.  Rublee  and  Sclkcr  have  investigated  the  user  in¬ 
terface  for  an  intelligent  geometric  editor.  Chelberg  has  inves¬ 
tigated  the  constraint  system  and  rule  base  of  ACRONYM,  using 
a  large  set  of  aircraft  models.  Minor  problems  were  identified 
and  Fixed,  lie  is  investigating  more  powerful  mechanisms  for 
the  constraint  system.  Lowry  has  done  initial  work  in  prob¬ 
lem  formulation  for  a  class  of  computational  geometry  problems. 
[Scott  83]  has  implemented  a  general  system  for  calculating  the 
terminators  (visible  boundaries  of  curved  surfaces)  for  a  broad 
class  of  generalized  cylinder  models.  The  algorithm,  capable 
of  parallel  implementation,  calculates  the  perspective  image  of 
a  Generalised  Cylinder,  from  arbitrary  viewpoint,  with  hidden 
surface  removal.  If  applies  to  a  wide  class  of  cylinders.  The  lime 
taken  will  be  proportional  to  the  total  length  of  the  contours, 
independent  of  the  number  of  edges.  The  algorithm  solves  for 


one  closed-loop  contour-generator  at  a  time,  testing  its  contour 
(in  the  image  plane)  for  intersection  with  visible  segements  of 
previous  contours. 

[Lowe  83]  have  extended  the  analysis  of  the  figurc/ground  prob¬ 
lem,  which  we  formulate  as  the  discovery  of  non-random  struc¬ 
ture  in  images,  whether  interpreted  as  surfaces  in  three  space, 
or  an  patterns  and  texture  in  the  plane.  Uniform,  non-random 
structure  has  an  interpretation  of  common  phyiscal  origin, 
Marimont  has  worked  at  finding  edges  in  intensity  surfaces.  As  a 
snbproblcm  of  segmenting  intensity  surfaces,  he  has  investigated 
segmenting  curves.  Results  have  been  obtained  for  the  problem 
of  determining  a  smooth  curve  through  two  samples,  each  with 
point,  tangent  vector,  and  curvature. 

[lilichor  83]  has  developed  some  fundamental  mathematical 
theory  underlying  vision,  lie  defines  a  mathematical  structure 
which  can  he  used  as  a  framework  for  studying  many  vision  prob¬ 
lems.  Drawing  on  differential  topology,  he  uses  the  framework 
to  prove  a  theorem  regarding  the  stereo  matching  problem.  The 
main  result  is  that  without  constraints  on  imaging  geometry, 
matching  of  typical  pictures  requires  at  least  2  color  dimensions 
for  uniqueness,  lie  also  presents  some  theory  about  the  topology 
of  iso-brightness  contour  lines,  which  is  useful  in  understand¬ 
ing  the  behavior  of  systems  which  track  some  value,  c.g.  zero- 
crossings.  The  paper  provides  vision  researchers  with  a  view  of 
some  of  the  powerful  results  oT  modern  differential  topology;  the 
methods  used  are  applicable  to  stereo,  motion  stereo,  optic  Mow, 
and  matching. 

[Ko.toncn  83]  is  investigating  ways  oT  formally  expressing  facts 
about  images.  In  particular,  he  can  show  that  some  of  the 
coincidence  assumptions  stated  in  [Binford  81]  can  actually  be 
proved  in  a  suitable  formal  framework. 

[Goad  33]  describes  the  automatic,  generation  or  special  purpose 
vision  programs.  The  starting  point  for  the  automatic  construc¬ 
tion  process  is  a  description  of  a  particular  3D  object.  'Flic  result 
is  a  special  purpose  program  Tor  recognizing  and  locating  that 
object  in  images,  without  restriction  on  the  orientation  of  the 
object  in  space.  Thus  each  object  description  is  analyzed  in  ad¬ 
vance,  and  then  “compiled"  into  an  efficient  program  for  detect¬ 
ing  that  object  in  images.  The  method  has  been  implemented 
ami  tested  on  a  variety  of  images  with  good  results.  Some  of 
the  tests  involved  images  in  which  the  target  objects  appear  in  a 
jumbled  pile.  The  current  implementation  is  not  fully  optimized 
for  speed.  However,  evidence  is  given  that  image  analysis  times 
on  the  order  of  a  second  or  less  can  be  obtained  for  typical  in¬ 
dustrial  recognition  tasks.  (This  time  estimate  excludes  edge 
finding). 

Perceptual  Organization 

We  have  a  practical  objective  which  is  to  implement  more  general 
interpretation  in  ACRONYM.  A  short  term  goal  is  an  improved 
ribbon  finder,  coupled  with  a  mechanism  for  making  canonical 
clusters  of  ribbons.  ACRONYM  matches  predicted  ribbons  or 
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is  now  established  as  a  technology  transfer  tool 
that  can  be  utilized  by  appropriate  agencies  to 
evaluate  the  applicability  of  the  contributed  scene 
analysis  techniques. 

Documental  on  of  the  Testbed  is  entering  its 
final  phase.  Final  drafts  of  the  User's  manual, 
the  Programmer's  manual,  and  the  System  Manager's 
manual  are  available  and  will  soon  pass  through  the 
required  editing  and  approval  procedures.  Drafts 
of  the  evaluation  reports  for  the  Chough  and 
Phoerix  programs  are  also  complete.  We  are 
currently  completing  both  the  evaluation  report  for 
the  Relaxation  package  and  the  user- level 
documentation  of  those  contributions  for  which  no 
detailed  evaluation  is  planned.  More  extensive 
studies  of  the  various  approaches  to  stereo 
compilation  now  available  on  the  Testbed  will  be 
integrated  into  the  ongoing  research  effort  on  the 
stereo  problem. 

The  Testbed  is  now  sufficiently  well-defined 
that  exact  copies  of  the  entire  system  can  be 
configured,  if  desired.  SRI,  under  a  separate 
contract,  is  just  completing  the  installation  of  a 
Testbed  copy  (hardware  and  software)  at  the  US  Army 
Engineer  Topographic  Laboratories  (ETL)  at  Fort 
Belvoir .  A  Lisp  Machine  will  be  added  to  the  ETL 
configuration  later  in  the  year.  SRI  will  also  be 
supplying  Lisp  Machines  and  Lisp  Machine  software 
to  the  DMAHTC  and  DMAAC  branches  of  the  Defense 
Mapping  Agency.  SRI  has  been  closely  involved  in 
efforts  to  ensure  that  the  upgrade  of  the  DMA 
AFES/RWPF  facilities  to  the  VAX-11/780  CPU  can 
incorporate  the  Image  Understanding  Testbed 
capabilities,  as  well  as  supporting  the  Lisp 
Machines . 

The  Testbed  software  system  and  its  utilities 
are  being  prepared  for  export  to  university 
researchers  in  the  IU  program  as  well  as  to  other 
U.S.  Government  agencies  interested  in 
establishing  Testbed  copies.  SRI  has  developed  a 
simple  license  agreement  to  help  protect  Testbed 
contributors  and  restrict  use  of  the  software  to 
appropriate  academic  and  govei  lent  research 
environments . 
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ellipses  with  observed  ribbons  or  ellipses;  then  it  tests  whether 
clusters  of  image  elements  satisfy  object  constraints  in  three 
space  Typically,  matching  single  ribhons  is  weak,  while  match¬ 
ing  all  pairs,  triples,  and  u-tupics  of  ribbons  is  combinatorialiy 
unattractive.  Limiting  the  combinatorics  leads  to  introducing 
proximity  grouping,  then  to  a  thorough  investigation  of  grouping 
mechanisms  from  lirst  principles. 

We  have  studied  fundamental  properties  of  perceptual  organiza¬ 
tion.  I  he  bottom-up  process  of  grouping  related  image  features 
plays  an  important  role  in  3-1)  inference,  model-based  recogni¬ 
tion,  and  matching  processes  such  as  stereo  eorrcspondance.  We 
measure  the  sign i licancc  ri  image  relations  as  inversely  propor¬ 
tional  to  the  probability  that  they  would  have  arisen  by  accident 
from  the  surrounding  distribution  of  features.  This  is  a  general 
measure  that  requires  no  prior  knowledge  of  the  scene,  and  can 
therefore  be  applied  uniformly  at  the  earliest  stages  of  the  image 
interpretation  process. 

Because  the  image  relation!}  are  likely  to  have  arisen  from 
properties  ol  the  scene  rather  than  through  an  accident  of  image 
formation,  they  provide  a  reliable  basis  for  matching  against 
models.  ACRONYM  currently  relies  on  ribbons  and  dipsos  as 
Us  perceptual  description  of  an  image,  but  this  set  could  be  ex¬ 
panded  to  include  all  reliably  detectable  relations.  Typically 
relations  which  are  non-accidental  will  be  quasi-invariant  with 
respect  to  viewpoint.  This  means  llial.  these  relations  can  be 
used  for  stereo  eorrcspondance  matching,  at  a  higher  and  more 
robust  level  I, ban  simple  edge  points. 

Wo  have  also  studied  the  complexity  of  the  process  of  forming 
image  relations,  it  would  Ik;  eomhinatorially  expensive  to  ex¬ 
amine  all  possible  relations  between  image  features.  Therefore, 
wc  have  used  diameter-limilcd  grouping  processes  applied  at  inul- 
tijdo  scales  and  overlapping  locations.  At  any  scale,  the  number 
of  alternatives  for  forming  relations  must  be  low,  or  none  will  be 
attempted.  In  this  way,  computation  is  limited  by  complexity 
rather  than  by  prior  limits  on  scale  or  density.  Some  of  this  work 
is  the  basis  for  estimates  concerning  architecture  of  intermediate- 
level  vision. 

Wc  have  currently  implemented  a  curve  description  program 
which  looks  Tor  non-accidental  linear  or  curvilinear  structure  in 
edge  data.  This  program  is  able  to  detect  signilicant  structure 
oecuriug  at  multiple  scales  in  the  same  edge.  Il  requires  no  prior 
knowledge  of  the.  noise  properties  in  an  edge,  but  uses  the  given 
data  to  estimate,  the  scales  at  which  the  curve  exhibits  the  most 
signilicant  structure. 

Stereo  Vision 

Baker  has  modified  the  stereo  system  to  include  edges  Trom 
[Marimont  82j.  I  he  system  now  uses  improved  edge  operators 
and  includes  edge  extent  in  seeking  optimal  correspondence.  The 
system  now  deeds  with  stereo  pairs  in  wliieli  epipolar  lines  arc  not 
coincident  with  the  camera  raster.  To  bring  this  about,  Mcllcr 
made  a  program  l.o  determine  epipolar  lines  from  the  camera 
transform  data  of  [Connery  77].  To  perform  the  interpolation  of 
surfaces  based  on  intensity  interpolation,  Mcllcr's  program  was 
made  to  produce  images  rectified  to  epipolar  geometry. 

A  system  was  developed  Tor  input  of  hand-segmented  images  as 
a  basis  tor  developing  higher  level  inference  and  correspondence 
functions  independent  of  the  development  of  segmentation  algo¬ 
rithms. 

Orthogonal  trihedral  vertices  (OTVs)  arc  an  important  struc¬ 
tural  element  in  buildings.  OTVs  were  analyzed  in  part  by 
[Licbes  81],  Malik  and  Binford  provided  an  analysis  for  general 
orientation  in  perspective.  The  analysis  was  implemented  as 
an  inequality  to  test  candidate  O  I  Vs;  all  OTVs  and  some  non- 


OTVs  arc  accepted  by  the  inequality.  An  algorithm  determines 
the  angle  in  space  from  a  single  image.  Clearly,  corresponding 
views  of  an  OTV  must  imply  the  same  orientation  in  space. 

Malik  and  Binford  have  determined  new  monocular  inference 
rules  which  have  applicability  to  stereo,  including  a  generalized 
support  interpretation.  They  have  produced  a  stereo  inference 
rule  which  imposes  a  sign  reversal  constraint  on  pairs  of  vectors. 
If  two  images  of  a  pair  of  vectors  are  to  correspond  tile  z- 
componont  ol1  their  cross  product  (determined  entirely  by  image 
quantities)  must  not  change  sign. 

Segmentation  and  Representation  of  Curves 

Wc  have  developed  an  algorithm  to  compute  a  segmentation  and 
representation  of  digital  curves,  applicable  to  edges  extracted 
from  images,  intended  to  facilitate  higher-level  analysis  of  curves. 
A  number  ol  psychological  and  mathematical  considerations  have 
led  us  to  segment  curves  at  extrema  and  zeroes  of  estimated 
curvature.  Psychological  data  suggest  that  humans  segment 
curves  at  extrema,  and  that  humans  are  insensitive  to  deriva¬ 
tives  of  order  higher  than  Iwo  (curvature  is  closely  related  to 
tiic  second  derivative).  Further,  zeroes  and  extrema  of  curva¬ 
ture  have  mathematical  properties  of  invariance  under  certain 
geometric  transformations  which  enable  reliable,  estimation  of 
curvature  characteristics  independent  of  the  curve’s  position  and 
orientation.  A  related  conjecture  currently  being  investigated  is 
whether  suitably  chosen  invariants  of  space  curves  map  stably 
under  perspective  projection  into  extrema  and  zeroes  of  curva¬ 
ture  of  the  image  plane  curve. 

We  estimate  curvature  at  all  scales,  and  a  pyramid  of  curvature 
estimates  is  constructed  suitable  for  detection  and  representation 
of  linear  and  hierarchical  relationships  among  the  estimates.  We 
use  this  pyramid  to  evaluate  robustly  the  significance  or  or  cur¬ 
vature  changes  at  one  scale  in  the  context  of  others;  wc  thereby 
eliminate  the  need  for  extensive  prior  knowledge  of  sensor  noise, 
for  instance.  ICsti  mates  of  significant  curvature  changes  are 
retained  at  all  scales,  so  tasks  needing  only  rough  estimates  arc 
not  computationally  overburdened  by  unnecessary  detail,  while 
those  able  to  use  high  accuracy  effectively  achieve  optimal  per¬ 
formance. 


Splines  for  Vision 

Wc  have  completed  a  preliminary  implementation  of  a  new  type 
or  spline  based  on  intrinsic,  geometric  properties  of  curves.  We 
argued  above  that  digital  curves  should  be  segmented  at  extrema 
and  zeroes  or  curvature.  This  new  spline  takes  as  input  two 
points,  two  tangents,  and  two  curvatures,  and  returns  a  curve 
which  is:  in  agreement  with  the  input  data  at  the  two  points; 
continuous;  continuous  in  tangent;  continuous  in  curvature,  with 
curvature  varying  monotonically  along  the  curve.  Curvatures  at 
the  endpoints  cannot  be  of  different  signs,  although  one  can  be 
zero  and  one  nonzero.  If  our  curvature  estimates  arc  consistent 
with  the  assumption  that  curvature  is  continuous,  this  restriction 
poses  no  problem,  since  placing  knots  at  all  zeroes  and  extrema  of 
curvature  implies  that  no  two  adjacent  knots  can  have  curvatures 
of  opposite  sign  (if  they  did,  there  would  be  a  zero  or  curvature 
between  them,  and  therefore  a  knot). 

Curvature  must  change  monotonically  between  knots  to  avoid 
introducing  spurious  curvature  extrema,  i.c.  extrema  not  present 
in  the  curve  underlying  our  curvature  estimates.  If  the  cur¬ 
vatures  at  the  two  points  to  be  splined  are  k\  and  k2,  with 
k\  less  than  k2,  then  the  statement  that  curvature  increases 
monotonically  betwee  k\  and  k2  is  mathematically  equivalent  to 
the  statement  that  there  exist  no  curvature  extrema  between  k\ 
and  k2  (assuming  curvature  is  continuous).  Since  the  perceptual 
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and  mathematical  importance  ol‘ curvature  extrema  dictated  the 
placement  of  knots  al  lln'tn,  il,  is  crucial  llial  the  spline  inlro- 
duci  no  curvature  extrema  not  presenl.  in  (.lie  data.  In  recogni¬ 
tion  of  (.lie  importance  of  thin  characteristic,  we  refer  to  these 
splines  as  monotone  curvature  splines.  The  current  implemen¬ 
tation  relies  on  the  relationship  between  evolntes  and  involutes, 
a  construct  from  classical  dillerential  geometry.  The  spline  it¬ 
self  is  the  involute,  determined  by  a  trivial  calculation  from  the 
evoliile;  finding  the  evolule  is  I  lie  computationally  hard  part. 
The  ovolnlo  is  not  determined  uniquely  from  the  input  data.  We 
have  chosen  to  use  four  circular  arcs,  pri marly  for  the  sake  of 
computational  elliciency;  the  resulting  evolule  is  continuous,  and 
continuous  in  tangent,  hut  not  in  curvature.  It  is  a  fortuitous 
aspect  of  the  relationship  between  evoliii.es  and  involutes  that 
the  involute’s  curvature  is  conlinous.  An  iterative  procedure  is 
used  to  lind  the  evolule;  it  converges  in  test  eases  satisfactorily 
and  rapidly,  although  more  testing  needs  to  be  done. 

Prediction  of  Generalized  Cylinders 


This  is  the  first  stage  of  a  system  for  manipulating  generalized 
cylinder  models.  It  includes  a  generalization  of  the  formulation 
ol  [Shafer  8;!]  for  the  prediction  of  the.  terminator  for  generalized 
cylinders.  The  algorithm  can  he  divided  into  two  parts,  hirst 
(A),  solution  for  the  visible  parts  of  the  coni  on  r-generators,  and 
secondly  (I!),  region  growing  to  get  visible  surfaces.  The  first  part 
is  the  principal  one.  It  lias  two  subparts,  which  arc  repeated,  and 
together  liiul  one  contour-generator.  Kaeli  contour-generator  is  a 
closed  loop,  intersecting  no  others,  which  divides  the  surface  into 
forward  (visible  if  tinocchided),  and  backward  facing  (invisible) 
areas.  The  square  roof  of  the  size  of  each  visible  area,  is  a 
measure  of  the  length  scale  over  which  things  are  happening  in 
that  region  of  the  GC. 

The  first  suhpart  (At),  steps  over  the  GC  with  step  length 
proportional  to  the  square  root  of  the  area  of  the  region  it  is 
contained  by,  until  either  the  whole  surface  lias  been  covered, 
in  which  case  the  algorithm  stops;  or  a  step  containing  a  new 
contour-gonorat.or  is  found.  In  this  case,  the  step  is  then  bisected 
down  to  an  exact  solution.  A  test  to  sec  whether  each  step 
jumps  a  new  contour-generator  can  be  made  since,  whenever 
the  direction  (forward  or  hack),  that  a  surface  point  is  facing, 
differs  from  the  direction  predicted  by  the  regions  of  t.hc  existing 
contour-generators,  then  there  must  bean  undiscovered  contour- 
generator  passing  nearby.  This  means  that,  if  one  stepping  point 
lias  the  same  predicted,  and  actual  surface  direction,  and  the 
next  docs  not,  then  a  new  contour  generator  passes  through  the 
intervening  step,  This  interval  is  reduced,  using  bisection,  witli 
Hie  condition  that  one  end  of  the  interval  must  have  the  same 
predicted,  and  real  surface  dircctonn,  while  the  other  end  must 
not. 

The  solution  is  handed  over  to  the  second  suhpart  (A2),  which 
propagates  it  around  the  whole  contour-generator,  back  to  its 
start,  making  a  list  of  the  solutions  as  il  goes,  and  noting  the 
ones  where  the  contour-generator  becomes  visible  or  occluded,  ft 
works  by  stepping  along  the  contour-generator  tangent  2,  to  get 
a  guess  for  the  next  solution  point,  which  is  Ncwton-Rnphson 
iterated  to  a  sullicienl  accuracy,  if  the  Newton  Raphson  does 
not  converge,  several  points  around  a  small  circle  are  tested 
to  find  an  interval  to  bisect  down  to  the  next  solution.  The 
step  length  is  taken  proportional  to  curvature  of  the  contour,  to 
get  uniform  interpolation  accuracy  between  the  known  contour 
poiutsil, 1  Radi  step  is  projected  to  the  image,  and  checked  I’or 
intersection  with  those  previously  projected  steps,  which  have 
not  been  shown  to  be  bidden.  When  an  intersection  is  found,  flic 
exact  positions  of  the  occluding  and  occluded  contour-generator 
points  are  caleulalcdO.  Finally  the  whole  contour  is  checked 
against  possible  surrounding  contours. 


To  convert  to  a  form  iinpleiiieiitable  in  parallel;  step  A  I  is  done 
independently,  at  di  lie  rent  places  and  then  A2  is  used  to  form 
contour-  generator  segments,  which  can  be  simply  joined  up  into 
the  complete  contour  lists.  Ki I, her  way,  each  list  of  contour  points 
is  now  followed  down,  keeping  count  of  the  marked  occluded 
points,  to  produce  lists  of  just  the  visible  ones. 

Automatic  Generation  of  special 

purpose  vision  programs 

Chris  Goad's  work  concerns  the  automatic  generation  of  special 
purpose  vision  programs. 

In  many  practical  applications  of  automated  vision,  the  vision 
task  fakes  the  form  of  recognizing  and  locating  a  particular  three 
dimensional  object  in  a  digitized  image.  The  exact  shape  of  the 
object  t.o  be  perceived  is  known  in  advance;  the  purpose  of  the 
act  of  perception  is  only  to  determine  its  position  and  orientation 
relative  to  the  viewer.  This  is  model  based  vision  in  its  strict 
form.  Most  industrial  applications  of  vision  have  this  property, 
and  also  the  property  that  llie  same  object  (or,  more,  precisely, 
objects  of  the  same  shape),  must  be  located  in  many  images. 

Goad’s  work  concerns  a  scheme  for  exploiting  this  kind  of  situa¬ 
tion  which  involves  automatically  constructing  special  purpose 
vision  programs.  The  starting  point  for  the  automatic  construc¬ 
tion  process  is  a  description  of  a  pnrficutai  31)  object.  The  result 
is  a  special  purpose  program  for  recognizing  and  locating  that 
object  in  images,  without  restriction  on  the  orientation  of  the 
object  in  space.  Since  this  special  purpose  program  has  a  com¬ 
paratively  limited  task  to  perforin,  it  '■an  be  much  faster  than 
any  general  purpose  vision  program  voud  be.  Tims  each  ob¬ 
ject  model  is  analyzed  in  advance,  ;  nd  Him  “compiled”  info 
an  cllieicnl  program  for  defecting  lb  it  object  in  images.  The 
method  has  been  implemented  and  listed  on  n  variety  of  images 
with  good  results.  Some  or  the  tests  involved  images  in  which  the 
target  objects  appear  in  a  jumbled  pile.  The  current  implemen¬ 
tation  is  not  fully  optimized  for  speed.  However,  evidence  is 
given  that  image  analysis  Limes  on  the  order  of  a  second  or  less 
can  bo  obtained  for  typical  '  'dustrial  recognition  tasks.  (This 
time  estimate  excludes  edge  finding). 

Mathematical  Analysis 

from  thccaniera  transform  data  of  [Gennery  77].  To  perform  the 
interpolation  models.  If  includes  a  generalization  of  the  formula¬ 
tion  of  [Shafer  82]  [Ketonen  83]  has  implemented  a  formal  repre¬ 
sentation  of  geometry  in  the  KKI,  system,  lie  lias  demonstrated 
that  some  of  the  coincidence  assumptions  stated  in  [llinlord  81] 
can  actually  be  proved  in  a  suitable  formal  framework. 

It  follows  from  bis  analysis  that  many  oT  flic  ’’impossible”  pic¬ 
tures  of  IlulTman  in  [2]  can  be  detected  by  simpler  and  more 
general  means  than  Hie  ones  used  by  Huffman,  Clowes  or  Waltz,. 
Given  that  these  methods  arc  simpler  (even  if  not  complete), 
they  may  be  closer  to  the  process  actually  used  by  the  human 
visual  system. 

One  should  not,  expect  formalisations  of  theories  to  have  tan¬ 
gible  connections  with  succosfiil  implementations  of  algorithms; 
Artificial  Intelligence  programs  need  not  bo  based  on  the 
paradigm  of  theorem  proving.  However,  the  clarification  of  the 
formal  concepts  underlying  these  systems  can  be  of  great  impor¬ 
tance  in  terms  of  program  architecture  and  further  development. 

In  libeller's  work,  a  unifying  abstract  mathematical  structure 
is  presented  for  a  number  of  vision  problems,  notably  stereo, 
motion  stereo,  optic  (low,  and  matching.  The  structure  is  sum¬ 
marized  in  Fig.  (*')  of  Hlielier’s  paper;  lie  defines  the  various 
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ABSTRACT 

This  report  summarizes  the  work  done  during 
the  final  two  years  of  Contract  DAAG53-76-C-0138, 
"Understanding  Objects,  Features,  and  Backgrounds." 
It  also  outlines  plans  for  work  to  be  conducted 
during  the  coming  three  years  on  Contract  DAAK70- 
83-K-0018,  "Autonomous  Vehicle  Navigation." 


1.  UNDERSTANDING  OBJECTS,  features,  AND 
BACKGROUNDS 

1.1.  Introduction 

In  June  1976  the  U.S.  Army  Night  Vision  and 
Electro-Optics  Laboratory  awarded  Contract  DAAG- 
53-76-C-0138  to  the  University  of  Maryland  for 
research  on  "Algorithms  and  Hardware  Technology 
for  Image  Recognition."  Funding  for  this  con¬ 
tract  was  derived  primarily  from  the  Defense 
Advanced  Research  Projects  Agency  under  DARPA 
Order  3206.  During  the  following  21-month  period, 
the  University  developed  and  tested  advanced  al¬ 
gorithms  for  detection  of  tactical  targets  on 
Forward-Looking  InfraRed  (FLIR)  imagery.  Concur¬ 
rently,  on  a  subcontract,  the  Westinghouse  Defense 
Systems  Division  designed  charge-coupled  device 
(CCD)  layouts  for  implementing  many  of  these  algo¬ 
rithms  in  hardware,  and  also  fabricated  a  CCD  chip 
that  implemented  one  basic  algorithm,  histogram- 
ming/sorting .  The  results  of  the  work  done  during 
the  first  21  months  of  the  contract  are  documented 
in  detail  in  a  Final  Report  dated  March  1Q78  [Al]. 

In  April  1978  the  contract  was  extended  for  a 
two-year  period,  under  the  new  title  "Image  Under¬ 
standing  Using  Overlays."  During  this  phase  of 
the  project,  numerous  algorithms  were  developed 
and  tested  for  object  detection  and  extraction  from 
images,  as  well  as  for  image  and  region  represen¬ 
tation.  On  a  subcontract,  Westinghouse  investi¬ 
gated  the  implementation  of  some  of  these  algori¬ 
thms  in  general-  or  special-purpose  digital  hard¬ 
ware.  Westinghouse  also  conducted  tests  of  one 
class  of  algorithms  known  as  "relaxation"  tech¬ 
niques.  The  results  of  the  work  done  during  this 
period  are  documented  in  a  series  of  technical  and 
semiannual  reports,  are  are  summarized  in  a  Final 
Report  dated  May  1980  [A2], 


In  May  1980  the  contract  was  extended  for  a 
final  two-year  period  (later  extended,  at  no  addi¬ 
tional  cost,  through  December  1982),  under  the 
title  "Understanding  Objects,  Features,  and  Back¬ 
grounds.'  During  this  phase  of  the  project,  fur¬ 
ther  studies  were  conducted,  in  collaboration  with 
Westinghouse,  on  object  segmentation  and  recogni¬ 
tion,  feature  extraction  and  background  analysis, 
multi-resolution  image  processing  techniques,  and 
analysis  of  time  varying  imagery.  This  work  was 
documented  in  a  series  of  project  status  reports 
[Bl-3]  and  Technical  Reports  [Cl-32],  and  is  sum¬ 
marized  in  this  Final  Report. 

Principal  Investigators  on  this  project  at 
the  University  of  Maryland  were  Profs.  Azriel 
Rosenfeld  and  Larry  S.  Davis,  and  at  Westinghouse, 
Dr.  Glenn  E.  Tisdale  and  Mr.  Bruce  J.  Schachter. 

The  project  monitor  at  NVEOL  is  Dr.  George  R.  Jones. 

1.2.  Object  segmentation  and  recognition 

a)  Comparative  segmentation  study 

A  comparative  study  of  object  extraction 
techniques  applicable  to  FLIR  imagery  was 
conducted  jointly  by  Maryland  and  Westing¬ 
house,  using  a  database  of  52  images  collected 
by  Westinghouse  from  Army,  Navy,  and  Air  Force 
sources.  Techniques  tested  by  Maryland  in¬ 
cluded  two  variations  of  a  relaxation  method 
as  well  as  new  methods  based  on  multiresolu¬ 
tion  image  representations,  known  as  "pyra¬ 
mids."  One  of  the  pyramid-based  methods  out¬ 
performed  all  the  other  techniques  tested. 

The  results  of  the  Maryland  study  are  docu¬ 
mented  in  detail  in  a  technical  report  [C19], 
while  Westinghouse 's  study  is  documented  in 
a  Westinghouse  report. 

b)  New  segmentation  techniques 

As  a  supplement  to  the  main  segmentation 
study,  several  new  segmentation  techniques 
were  developed  under  the  project.  Two  methods 
developed  on  earlier  projects  were  extended 
from  single-band  to  multiband  imagery.  One 
of  these  improves  the  detectability  of  clus¬ 
ters  in  a  histogram  or  scatterplot  by  sup¬ 
pressing  pixels  that  lie  on  edges  [Cl].  The 
other,  known  as  "Superspike,"  converts  the 
peaks  in  a  histogram  or  scatterplot  into  sharp 
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spaces  and  mappings  present  in  performing  mnthing,  and  their 
relationships  and  properties.  This  is  done  in  a  I'airly  abstract 
way,  so  as  to  Ire  applicable  to  many  dilToront  typos  ol‘  vision 
problem.  For  example,  the  same  Formalism  describes  perspective 
as  well  as  orthogonal  projection,  unusual  camera  geometries,  and 
projection  onto  a  plane  or  a  sphere,  etc.  Dlicher  believes  that  this 
type  oF  language  can  eventually  be  converted  Into  a  computer 
language  For  describing  a  computational  environment  for  vision. 

Ideas  from  modern  differential  topology  are  presented  and  .ap¬ 
plied  to  the  general  matching  problem,  a  common  approach  to 
stereo  matching,  delined  as  follows.  Given  2  picture  functions 
A'i  ,  /‘a  •  M"  — >  /f”,  one.  linds  regions  K\,Ki  C  M2  and  a  I-l 
matching  function  r/„  :  K \  Ki  such  that  l'\  =  Hlichcr 

proves  a  “2-color  theorem”:  tliat  generically  for  monochrome 
pictures  (n  =  I)  there  is  a  large  infinity  of  solutions,  but  for  2  or 
more  colors  (n  >  2)  the  solution  is  unique.  In  the  monochrome 
case,  the  solutions  can  be  quite  dilfercnt,  matching  the  same 
point  to  widely  separated  target  points.  Though  the  theorem 
literally  deals  with  matching  grey  levels,  it  is  equally  valid  for 
a  derived  function,  such  as  the  output  of  a  lateral  inhibition 
operator,  a  smoother,  or  an  edge  filter,  although  only  areas  lack¬ 
ing  occlusion  are  considered. 

“Generic”  is  a  central  concept  in  differential  topology,  which 
means  “almost  always”  in  a  precise  way,  allowing  one  to  exclude 
pathological  or  unlikely  behaviors  which  cannot  be  encountered 
in  practice,  thus  making  many  problems  tractable.  This  can  find 
application  as  well  in  inferring  structure  from  images. 

The  proof  of  the  theorem  for  the  monochrome  ease  is  based 
on  a  very  simple  intuitive  argument  involving  sliding  iso¬ 
brightness  contours  along  themselves.  Independently  of  proving 
the  theorem,  to  llesh  out  the  intuition,  Dlicher  presents  some 
facts  about  how  such  contour  lines  can  look.  Although  no  use 
is  made  of  it  in  the  paper,  such  information  in  itself  is  use¬ 
ful  for  matching,  as  topological  structure  is  invariant  for  small 
perturbations,  hence  it  is  important  to  classify  the  possibilities 
into  a  small  discrete  set.  Also,  this  theory  is  useful  for  under¬ 
standing  any  real-valued  function  on  a  picture,  for  example  the 
zero-crossings  of  an  edge  finder,  or  the  values  of  some  curvature 
parameter,  say  Gaussian  curvature,  or  even  some  local  Fourier 
eocllicionl,  as  one  might  use  for  a  texture  system. 

equation-free  sentence: 

Ideas  from  modern  differential  topology  arc  presented  and  ap¬ 
plied  to  the  general  matching  problem,  a  common  approach  to 
stereo  matching,  defined  as  follows. 

Given  2  picture  functions,  one  linds  2  regions  and  a  1-1 
matching  function  between  them  such  that  the  match¬ 
ing  function  preserves  grey  level  values. 
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spikes  (thus  making  them  trivially  detectable) 
by  a  process  of  iterated  local  averaging  in 
which  the  histogram  is  used  as  a  guide  a . 
selecting  those  neighbors  with  which  a  given 
pixel  should  be  averaged  [C26].  A  third 
method,  "bimean  clustering,"  identifies  the 
two  "best"  subpopulations  in  a  histogram  by 
finding  the  pair  of  values  that  gives  a  best 
fit  to  the  histogram  in  the  least  squares 
sense  [C20] . 

c)  Object  identification  using  constraint 

f iltering 

The  conventional  approach  to  recognizing 
targets  in  FLIR  imagery  is  to  extract  poten¬ 
tial  target  regions  using  segmentation  tech¬ 
niques,  and  then  carefully  analyze  the  proper¬ 
ties  of  each  region  independently  in  order  to 
determine  whether  or  not  it  could  be  a  target. 
We  have  investigated  a  complementary  approach 
based  on  comparisons  among  regions  rather 
than  analysis  of  individual  regions.  After 
the  image  is  segmented,  we  give  each  region 
a  set  of  possible  labels  -  e.g.,  "sky," 
"ground,"  "smoke,"  "tree,"  "tank."  We  then 
attempt  to  eliminate  labels  from  the  regions 
based  on  their  relationships  with  other  re¬ 
gions  (relative  property  values,  relative 
positions,  etc.).  This  method  performed 
successfully  in  a  small  set  of  tests;  it  eli¬ 
minated  the  "tank"  label  from  all  the  non¬ 
tank  regions  but  kept  it  for  all  the  tank 
regions  [C25],  This  approach  should  be  of 
interest  as  a  supplement  to  existing  target 
recognition  algorithms. 

1-3.  Feature  extraction  and  background  analysis 

a)  Edge  and  corner  extraction 

Feature  detection  (e.g.,  edge  detection) 
is  an  important  adjunct  to  object  recognition, 
and  also  plays  an  important  role  in  image 
matching  (e.g.  ,  fetj-  object  tracking  and  time- 
varying  imagery  analysis).  Three  feature 
detection  studies  w\re  conducted  on  this  pro¬ 
ject.  The  optimal  approach  to  edge  detection 
developed  by  Hueckel ,  which  finds  the  best¬ 
fitting  step  function  to  a  given  image  neigh¬ 
borhood,  was  applied  to  derive  optimal  edge 
operators  for  a  class  of  small  neighborhoods 
[C28]  .  A  basic  new  method  of  evaluating  edge 
detector  output,  based  on  consistency  of  the 
edge  output  data,  was  developed  and  success¬ 
fully  tested  [ C 8 ] .  A  simplified  method  of 
corner  detection  was  developed  based  on  de¬ 
tecting  discontinuities  in  one-dimensional 
projections  of  the  image;  this  method  elimi¬ 
nates  the  need  to  apply  computationally  ex¬ 
pensive  higher-order  derivative  operators  at 
every  point  of  the  image  [ Cl 3 ] . 

b )  Blob  and  ribbon  extraction 

Work  was  also  done  on  the  detection  of 
higher-level  features  such  as  "blobs"  and 
'ribbons"  in  an  image.  (A  blob  is  surrounded 


by  consistently  facing  edges,  while  a  ribbon 
is  characterized  by  "antiparallel,"  oppositely 
facing  edges.)  F.dge  linking  schemes  were 
developed  for  detecting  such  features  based 
on  compatibility  of  the  edges  with  respect  to 
both  geometry  and  gray  level  [C2],  Quantita¬ 
tive  measures  for  edge  compatibility  were  also 
developed  for  assessing  both  closedness  [ C 3 ] 
and  antiparallelness  [C4J. 

c)  Texture  analysis 

In  connection  with  image  background  charac¬ 
terization,  two  texture  analysis  studies  were 
conducted.  An  approach  to  texture  analysis 
based  on  average  strength  of  match  with  vari¬ 
ous  local  patterns  was  implemented;  it  was 
found  to  perform  better  than  several  standard 
methods  [C18],  The  idea  of  applying  texture 
measures  to  arrays  of  terrain  elevation 
data  was  also  briefly  explored;  if  such 
data  were  available  at  sufficient  reso¬ 
lution,  it  would  provide  a  useful  supple¬ 
ment  to  intensity-based  texture  analysis 
[C15]  . 

1.4.  Multi-resolution  image  analysis 
a)  Background  and  related  work 


A  potentially  powerful  new  approach  to 
image  analysis,  now  under  development  at  our 
laboratory,  is  based  on  the  use  of  a  "pyra¬ 
mid"  of  successively  reduced-resolution  ver¬ 
sions  of  the  given  image.  Initial  work  on 
image  segmentation  using  pyramids  was  done 
under  NSF  sponsorship.  During  the  summer 
of  1982,  a  workshop  on  "Multiresolution 
Image  Processing  and  Analysis"  was  held, 
also  under  NSF  sponsorship,  at  which  about 
25  research  groups  presented  recent  results 
that  make  use  of  multiresolution  image  rep¬ 
resentations  in  various  ways.  The  pyramid 
image  representation  also  has  the  advantage 
of  compatibility  with  the  quadtree  region 
representation,  which  was  extensively  studied 
during  an  earlier  phase  of  this  project,  and 
which  is  being  further  studied  in  connection 
with  cartographic  data  base  applications 
under  the  sponsorship  of  the  U.S.  Army  En¬ 
gineer  Topographic  Laboratory. 

b)  Segmentation  and  representation 

One  way  of  using  the  pyramid  representation 
segmentation  is  to  define  links  between  pixels 
and  their  "parents"  at  consecutive  levels  of 
the  pyramid,  based  on  mutual  similarity;  this 
gives  rise  to  subtrees  of  the  pyramid,  and 
thus  defines  a  partition  of  the  image,  where 
each  part  consists  of  the  pixels  that  are 
the  leaves  of  a  given  subtree.  A  number  of 
variations  on  this  basic  approach  were  investi¬ 
gated  [C6],  and  it  was  also  generalized  to 
multispectral  imagery  [ C 1 1  ]  .  In  connection 
with  quadtree  region  representation,  earlier 
work  on  the  generation  of  an  image  row  by  row 
from  its  quadtree  was  extended  to  include 
several  new  algorithms  -[  C  7  ]  . 
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Pyramids  can  also  be  used  to  extract  and 
represent  features  such  as  edges  and  blobs 
in  an  image.  If  we  use  mutual  similarity 
as  a  basic  for  linking  "edgels,"  rather  than 
pixels,  in  a  pyramid  representation,  we  ob¬ 
tain  "trees"  of  edges  which  allow  us  to  de¬ 
tect  the  major  edges  in  an  image,  at  the 
higher  levels  of  the  pyramid,  and  then  locate 
these  edges  precisely  at  the  full -resolution 
level  [CIA],  Pyramids  can  also  be  used  to 
encode  edges  (or  curves)  detected  in  an  image, 
yielding  successively  coarser  approximations 
as  long  as  the  edges  crossing  a  given  block 
of  the  image  can  be  compactly  approximated 
[C5].  These  approximations  can  then  be  used 
as  an  aid  in  linking  together  edge  segments 
that  lie  on  long  lines  or  smooth  curves  [C30]. 
Two  approaches  to  blob  extraction  using  pyra¬ 
mids  were  also  investigated.  One  of  these 
uses  pixel  linking  to  construct  subtrees  of 
the  pyramid  such  that  leaves  of  each  subtree 
are  the  pixels  that  belong  to  a  compact, 
homogeneous  piece  of  the  image  [C24].  Ano¬ 
ther  approach  is  based  on  the  fact  that  any 
blob  shrinks  to  a  (local)  "spot"  at  some 
level  of  the  pyramid;  it  detects  blobs  by 
constructing  an  edge  pyramid  and  detecting 
pixels  that  are  locally  surrounded  by  edges 
[C21],  This  method  outperformed  all  the 
others  that  were  tested  in  the  comparative 
study  of  FLIR  image  segmentation  techniques 
(see  above,  Section  1.2a). 


Time-varying  imagery  analysis 


a)  Image  matching 


One  approach  to  detecting  and  analyzing 
motion  in  an  image  sequence  is  to  identify 
sets  of  corresponding  points  in  successive 
frames  of  the  sequence.  This  is  usually 
done  by  searching  for  matches  to  pieces  of 
one  frame  in  the  other  frame.  In  order  to 
obtain  sharp  matches,  it  is  desirable  to 
use  pieces  that  contain  distinctive,  high- 
contrast  features  such  as  corners  (they 
are  preferable  to  edges  because  the  match 
to  an  edge  is  insensitive  to  displacement 
in  the  direction  along  the  edge) .  Some  suc¬ 
cessful  experiments  in  image  matching  using 
corner  features  are  described  in  [C12]  .  A 
supplemental  experiment,  reported  in  [C17], 
showed  that  local  intensity-based  matching 
in  the  neighborhood  of  a  feature  point  can 
be  used  to  unambiguously  locate  match  peaks 
in  those  cases  where  the  results  of  the 
feature  matching  are  ambiguous. 


components  only  in  directions  where  there 
are  rapid  changes  in  gray  level.  Thus  in 
a  smooth  region  it  yields  no  useful  informa¬ 
tion;  at  an  edge  it  yields  only  the  compo¬ 
nent  of  motion  in  the  direction  across  the 
edge;  but  at  corner  pixels  it  yields  two 
components,  thus  allowing  the  entire  motion 
vector  to  be  estimated  [C16],  Given  a  re¬ 
gion  in  the  image  representing  a  rigid  object 
moving  parallel  to  the  image  plane,  we  can 
estimate  motion  vectors  at  the  corners  of 
the  object  and  "propagate"  these  estimates 
around  the  edges  of  the  object  to  deter¬ 
mine  its  motion  (translation  and  rotation) . 
This  approach  to  motion  estimation  was  de¬ 
veloped  in  a  series  of  reports  [C22 ,C23 , C29]  . 

The  motion  vector  fields  obtained  from 
snail  image  neighborhoods  are  noisy.  If 
they  are  smoothed  by  simple  local  averag¬ 
ing,  incorrect  results  are  obtained  at  the 
boundaries  of  moving  objects.  A  better 
approach  is  to  use  nonlinear  smoothing 
techniques  based  on  selective  local  aver¬ 
aging;  this  does  not  blur  sharp  edges 
[C31 ] .  A  related  problem  is  that  of 
smoothing  the  images  in  a  sequence  by 
averaging  successive  frames;  here  one  can¬ 
not  simply  average  corresponding  pixels, 
but  must  introduce  displacements  in  order 
to  allow  for  the  motion.  In  this  connec¬ 
tion,  one  need  not  know  the  entire  motion 
vector,  but  only  its  component  in  the  gra¬ 
dient  direction,  since  errors  in  the  tan¬ 
gential  direction  will  not  cause  edges 
to  become  blurred  [C32] . 


The  changes  in  an  image  sequence  due  to 
the  motion  of  the  observer  relative  to  the 
scene,  rather  than  to  object  motion,  are 
known  as  "optical  flow."  Given  an  array  of 
motion  vectors  representing  optical  flow, 
methods  have  been  developed  of  inferring 
the  parameters  of  the  observer's  motion 
(translational  and  rotational)  and  of  de¬ 
riving  the  relative  distances  between  the 
observer  and  the  points  ^n  the  scene.  Al¬ 
gorithms  for  deriving  relative  scene  dis¬ 
tance  and  local  surface  orientation  from 
optical  flow  are  presented  in  [C9],  while 
a  method  of  deriving  the  observer's  in¬ 
stantaneous  direction  of  motion  from  opti¬ 
cal  flow,  and  of  decomposing  his  motion 
into  translational  and  rotational  components, 
is  developed  in  [CIO]  . 

1.6.  Status  reports 


b)  Motion  estimation  and  smoothing 


Another  approach  to  motion  detection,  ap¬ 
propriate  in  cases  where  the  rate  of  motion 
does  not  exceed  one  pixel  per  frame,  in¬ 
volves  using  the  space  and  time  derivatives 
of  the  image  intensity  at  each  pixel  to 
estimate  a  motion  vector  at  that  pixel.  This 
method  yields  reliable  estimates  of  motion 


As  mentioned  in  the  Introduction,  three  pro¬ 
ject  status  reports  were  issued  [Bl-3]  summarizing 
the  work  done  during  this  phase  of  the  project. 

The  first  and  third  of  these  reports  were  also 
published  in  the  Proceedings  of  the  two  DARPA 
Image  Understanding  Workshops  that  were  held  dur¬ 
ing  this  period  (April  1931  and  September  1982)  . 
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At  a  meeting  of  the  principal  investigators 
on  the  DARPA  Image  Understanding  Program,  held 
in  January  1982,  it  was  decided  to  prepare  a  Final 
Report  on  the  overall  program.  The  University  of 
Maryland  was  asked  to  draft  the  portion  of  this 
report  dealing  with  two-dimensional  image  analysis 
techniques  ("low-level  vision").  An  edited  ver¬ 
sion  of  this  draft  was  also  issued  as  a  technical 
report  [C27]. 


2.  Semi-annual  report  (1  February  -  31  July 
1981),  August  1981. 


3. 


Project  status  report  (1 
July  1982) ,  July  1982. 


August  1981  -  31 


C.  Technical  Reports  on  the  current  phase  of  the 
project  ~  '  - - - 


2.  AUTONOMOUS  VEHICLE  NAVIGATION 

This  project  is  concerned  with  developing 
navigation  techniques  for  an  autonomous  outdoor 
ground  vehicle.  The  vehicle  will  have  access  to 
a  stored  database  containing  information  about  the 
terrain  on  which  it  is  to  operate,  and  will  have 
sensory  input  from  a  passive  optical  or  IR  sensor. 
The  key  problem  in  navigating  the  vehicle  is  to 
relate  the  sensory  input  to  the  stored  data  in 
order  to  determine  the  location  of  the  vehicle  and 
the  locations  of  landmarks  or  goals,  and  to  plan 
paths  (from  the  current  location  to  a  goal)  that 
satisfy  given  constraints.  Additional  tasks,  on 
which  preliminary  work  will  also  be  done,  relate 
to  short-range  sensing  (e.g.,  for  obstacle  avoid¬ 
ance)  and  to  real-time  analysis  of  time-varying 
imagery. 

Since  this  project  was  initiated  quite  re¬ 
cently,  this  report  provides  only  a  general  out¬ 
line  of  the  planned  tasks.  A  vehicle  and  a  test 
site  have  been  tentatively  selected.  Westinghouse 
will  gather  data  regarding  the  site  (e.g,,  high- 
resolution  terrain  model  and  sample  imagery)  and 
will  also  design  and  assemble  the  vehicle  system. 
Maryland  will  develop  algorithms  for  processing  the 
imagery,  relating  it  to  the  stored  data,  and  plan- 
n.ing  paths  for  the  vehicle.  When  these  algorithms 
have  achieved  adequate  performance,  Westinghouse 
will  adapt  them  to  run  on  the  vehicle's  on-board 
computer,  after  which  they  will  be  tested  under 
real-world  conditions.  Concurrently,  Maryland 
will  continue  to  study  problems  related  to  short- 
range  sensing  and  real-time  processing.  Maryland's 
work  during  the  initial  months  of  the  project  has 
dealt  primarily  with  time-varying  imagery  analy¬ 
sis;  a  paper  reporting  on  one  aspect  of  this  work 
appears  elsewhere  in  these  Proceedings. 
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The  major  focus  of  our  DARPA  funded  research 
ogram  revolves  around  issues  of  dynamic  image 
processing.  We  have  been  examining  techniques  for 
recovery  of  environmental  information,  such  as 
depth  maps  of  the  visible  surfaces,  from  a  sequence 
of  images  produced  by  a  sensor  in  motion. 

Algor ithms  that  appear  robust  have  been  developed 
for  constrained  sensor  motion  such  as  pure 
translation,  pure  rotation,  and  motion  constrained 
to  a  plane.  Interesting  algorithms  with  promising 
preliminary  experimental  results  have  also  been 
developed  for  the  case  of  general  sensor  motion  in 
images  where  there  are  several  significant  depth 
discontinuities,  and  for  scenes  with  multiple 

independently  moving  objects.  A  general 
hierarchical  parallel  algorithm  for  efficient 
feature  matching  has  also  been  developed  for 
applications  of  motion,  stereo,  and  image 

registration. 
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In  addition,  have  been  designing  a  highly 
parallel  architecture  that  integrates  aspects  of 
both  parallel  array  processing  and  associative 
memories  for  real-time  implementation  of  motion 
algorithms.  Finally,  there  has  been  a  continuation 
of  the  VISIONS  static  image  interpretation  project, 
with  interesting  results  in  top-down  processing  of 
a  set  of  complex  outdoor  house  scenes  </v  Each  of  the 
above  research  topics  is  documented  \  in  papers 


appearing  in  these  proceedings  [1-6], 


\ 


I.  QUANTITATIVE  MOTION  PROCESSING  FOR 
RECOVERY  OF  ENVIRONMENTAL  DEPTH 

1.1.  INTRODUCTION 


combining  the  two  stages  in  a  robust  manner . 

The  set  of  image  displacements  from  two  or 
more  images  is  an  approximation  to  optic  flow. 
During  this  stage  of  the  processing  one  faces  the 
well-known  correspondence  problem,  which  involves 
the  matching  of  corresponding  image  points  of  an 
environmental  feature  in  the  pair  of  images.  The 
second  stage  involves  inference  of  environmental 
information  from  the  optic  flow  or  the  displacement 
field.  This  becomes  a  problem  of  separating  the 
translational  and  rotational  components  of  the  flow 
field . 

Rotation  of  the  sensor  induces  image 
displacements  that  are  a  function  only  of  the 
rotational  parameters  and  image  position;  in 
particular  the  feature  displacement  between  images 
is  not  a  function  of  the  depth  of  its  environmental 
surface  point. 

The  translational  motion  of  the  sensor  carries 
all  of  the  environmental  cues.  For  purely 
translational  motion,  the  image  displacement  paths 
are  determined  by  radial  flow  lines  emanating  from 
a  single  point  in  the  image  plane,  that  is  the 
intersection  of  the  translational  axis  with  the 
image  plane  (also  referred  to  as  the  focus  of 
expansion  -  FOE).  The  size  of  displacements  along 
these  paths  are  a  function  of  environmental  depth 
and  distance  from  the  FOE.  Thus,  the  problem  of 
general  motion  becomes  one  of  decomposing  the 
rotational  and  translational  effects  of  motion,  and 
then  using  the  image  displacements  from  the 
instantaneous  component  of  translational  motion  to 
compute  depth. 


The  major  goal  in  motion  processing  is  the 
recovery  of  the  motion  parameters  of  the  sensor  and 
each  independently  moving  object.  The  computation 
of  environmental  depth  of  visible  surfaces  follows 
m  a  rather  straightforward  manner.  This  has 
generally  involved  two  stages  of  processing: 
computation  of  a  feature  displacement  field, 
followed  by  inference  of  motion  parameters  and 
environmental  depth.  We  will  present  several 
algorithms  for  performing  this  computation  in 
independent  stages,  and  in  several  restricted  cases 
of  sensor  motion  some  new  alternatives  for 


1.2.  RESTRICTED  CASES  OF  SENSOR  MOTION 

Our  primary  technique  for  depth  inference  has 
been  derived  in  Lawton's  forthcoming  doctoral 
dissertation  [71.  He  has  shown  that  in  the  cases 
of  restricted  sensor  motion  -  pure  translation, 
pure  rotation,  and  motion  constrained  to  a  plane  - 
one  can  bypass,  or  at  least  simplify,  the 
correspondence  problem  by  combining  the  computation 
of  the  motion  parameters  with  the  determination  of 
image  displacements. 
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Let  us  illustrate  with  the  case  of  pure 
translational  motion  [3].  There  are  two  ui. known 
sensor  parameters  which  can  be  specified  by  the 
intersection  of  the  translation  axis  with  the  image 
plane  (the  FOE).  For  a  given  FOE,  the  flow  lines 
emanate  radially  from  this  point,  and  therefore  the 
matching  of  an  image  point  in  one  frame  to  its  new 
position  in  the  second  frame  has  been  reduced  to  a 
one-dimensional  search  along  the  straight  line 
between  the  FOE  and  the  image  point.  While  there 
may  still  be  spurious  high  correlations  possible, 
the  number  of  incorrect  good  matches  will  be 
greatly  reduced  over  the  usual  two-dimensional 
correlation  process.  In  cases  of  the  incorrect  FOE 
there  is  a  strong  probability  that  many  points  will 
have  poor  correlations  at  all  points  along  the 
hypothesized  displacement  path.  The  shape  of  the 
resulting  error  function  can  be  'improved  by 
selection  of  "interesting"  image  point.,  of  high 
contrast  (boundaries)  and  high  curvature  (corners). 

The  determination  of  the  translational  motion 
parameters  has  now  become  a  search  process  using  a 
global  error  measure  which  is  the  sum  of  the  errors 
of  the  best  match  on  each  point's  flow  path.  The 
search  process  consists  of  two  phases:  a  global 
sampling  of  the  error  measure,  and  then  a  local 
search  at  a  finer  sampling  to  determine  the 
minimum.  The  error  function  appears  to  be  very 
well  behaved  in  a  series  of  experiments  on  real 
scenes,  and  the  algorithm  seems  rather  robust. 

In  the  case  of  pure  rotation,  the  basic 
technique  can  be  applied  with  minor  differences. 
The  search  space  for  the  correct  rotational 
parameters  is  three-dimensional:  two  parameters 
for  the  axis  of  rotation  and  one  for  the  magnitude 
of  rotation.  The  algorithm  can  proceed  in  the  same 
manner  by  choosing  a  set  of  distinguished  points, 
and  then  compute  a  global  error  on  a  coarsely 
sampled  parameter  space.  This  problem  is  actually 
slightly  more  constrained  than  the  first,  because 
here  the  third  dimension  of  the  amount  of  rotation 
will  directly  constrain  the  image  motion  of  all 
points  simultaneously,  while  in  the  translational 
case  each  point  had  to  be  matched  independently 
(because  of  differences  in  environmental  depth). 

In  the  case  of  motion  restricted  to  a  known 
plane,  there  are  only  two  degrees  of  freedom. 
Translational  motion  will  be  constrained  to  the  one 
dimension  of  the  line  represented  by  the 
intersection  of  the  known  plane  and  the  image 
plane.  The  axis  of  rotation  must  be  perpendicular 
to  the  plane,  and  therefore  we  must  only  determine 
the  degree  of  rotation. 

A  set  of  experiments  have  proven  these 
algorithms  to  be  very  robust  in  real  scenes, 
including  the  outdoor  roaa  sequence  from  William’s 
thesis  [10]  and  industrial  image  domains  supplied 
by  the  General  Electric  Corporation. 


1.3.  RECOVERY  OF  DEPTH  FROM  GENERAL  SENSOR  MOTION  , 

i* 

As  we  have  pointed  out  earlier,  the  flow 
fielus  produced  by  a  sensor  undergoing  general 
motion  are  difficult  to  interpret  until  they  have 
been  decomposed  into  their  rotational  and  * 

translational  components.  Once  this  has  taken 

place,  environmental  depth  can  be  recovered  from 
translational  displacements.  Analytical  techniques 
for  performing  this  computation  are  extremely 
complex  and  can  be  quite  sensitive  to  the  errors 
that  are  typical  in  the  computation  of  displacement 
fields.  It  is  not  feasible  to  exploit  the  approach 
of  the  previous  cases  where  potential  motion 
parameters  were  tested  by  computing  a  global  error 
measure  of  lack  of  consistency  across  a  set  of 
image  features.  In  the  orevious  cases  the 

dimensionality  of  the  sear  space  was  no  greater 
than  three,  but  here  it  is  a  five-dimensional 
search  space,  and  the  computational  demands  may  be 
excessive.  In  addition  the  error  function  cannot 
be  expected  to  be  well-behaved  so  that  simple 
optimization  techniques  probably  would  not  work. 

Recently  Lawton  and  Rieger  [2]  have  described 
a  surprisingly  simple  technique  that  promises  to  be 
rather  robust  in  noisy,  low  resolution  and/or 
sparse  displacement  fields.  It  depends  upon  the 
scene  containing  a  sufficient  number  of  depth 
discontinuities  of  sufficient  depth  difference . 
Thus,  a  scene  with  several  objects  at  distinct 
depths,  or  a  single  object  of  reasonable  size 

against  a  textured  background,  will  permit  this 
technique  to  be  effective. 

„  /•* 

Consider  distinct  surface  features  at 
different  depths  on  an  occlusion  boundary.  Sensor 
rotation  causes  an  equal  rotational  displacement 
because  these  points  appear  at  the  same  image 

location.  Thus,  the  only  difference  in  their  image 
displacement  is  caused  by  a  difference  in 
translational  displacement.  This  leads  to  an 
algorithm  which  will  exploit  nearby  image  points 
which  are  at  different  depths.  Note,  however,  that 
occlusion  need  not  be  determined  because 
differences  can  be  taken  of  all  nearby  flow 
vectors.  They  will  be  oriented  on  radial  flow 
lines,  emanating  from  the  instantaneous  axis  of 
translation  which  can  be  determined  by  an 
optimization  procedure.  There  are  several 
approaches  to  determining  the  axis  of  translation, 
such  as  the  use  of  a  Hough  transform  to  select  the 
point  that  most  nearly  lies  at  the  intersection  of 
these  vectors.  Due  to  practical  noise 

considerations,  a  global  error  measure  is  used  to 
evaluate  each  possible  value  for  the  direction  of 
the  translational  axis  in  a  coarse  to  fine  search. 

The  error  measure  used  is  the  sum  of  the  magnitudes 
of  the  error  angles  of  the  difference  vector  field 
and  the  set  of  radial  field  lines.  Chce  the 
instantaneous  axis  of  translation  is  determined, 
then  the  rotational  component  is  overconstrained, 
can  be  determined  and  then  subtracted  out. 
Environmental  depth  of  image  points  can  then  be 
computed  from  the  translational  displacement. 


38 


The  algorithm  is  not  quite  so  straightforward 
because  there  may  not  be  many  reliable  image 
displacement  vectors  that  are  at  different  depths 
and  near  each  other.  To  the  degree  that  they  are 
not  at  sufficiently  different  depths,  their 
difference  vector  will  be  short  and  prone  to  error. 
To  the  degree  that  they  are  not  near  each  other, 
their  rotational  components  will  differ  and 
introduce  error.  Thus,  practical  considerations  in 
the  application  of  the  algorithm  remain.  However, 
several  experiments  have  shown  very  promising 
results. 

It  should  be  noted  that  occlusion  boundaries 
of  independently  moving  objects  will  not  satisfy 
the  conditions  for  applying  this  algorithm,  and 
thus  the  next  algorithm  complements  this  work. 


1.4.  SCENES  WITH  MULTIPLE  INDEPENDENTLY  MOVING 
OBJECTS 

The  algorithms  that  we  have  just  described  do 
not  confront  the  additional  complexity  introduced 
when  there  are  multiple  independently  moving 
objects.  The  global  types  of  constraints  that  were 
described  earlier  no  longer  apply  across  the  entire 
image.  The  case  of  a  sensor  moving  through  a 
static  environment  can  be  equivalently  viewed  as  an 
image  of  a  single  rigid  object  with  associated 
motion  parameters.  However,  if  there  are 

independently  moving  objects,  they  will  have 
different  motion  constraints  and  introduce  possibly 
serious  errors  in  the  global  search  of  the 
parameter  space  for  a  single  set  of  motion 
parameters.  Thus,  the  goal  is  to  decompose  the 
image,  and  thereby  separate  the  information  in  each 
flow  field,  so  that  motion  of  each  object  can  be 
recovered . 

The  approach  outlined  here  is  presented  by 
Adiv  [4],  It  involves  a  generalized  Rough 
transform,  proposing  solutions  to  some  of  the 
problems  found  in  this  technique.  Hough  techniques 
are  relatively  insensitive  to  noise  and  can  deal 
with  partially  incorrect  or  occluded  data.  Here, 
such  a  transform  will  be  used  to  group  a  set  of 
displacement  vectors  which  satisfy  the  same  motion 
parameters.  However,  there  are  a  set  of  problems 
that  must  be  considered:  non-adjacent  elements  can 
vote  for  the  same  image  transformation,  there  are 
difficulties  in  the  detection  of  the  motion 
parameters  of  small  objects,  and  fine  resolution  of 
the  motion  parameter  space  can  require  large 
amounts  of  memory  and  computation  time. 

The  suggested  solution  to  these  problems 
involves  a  modified  multipass  approach.  In  each 
pass  windows  are  located  around  potential  objects 
by  the  degree  to  which  the  displacement  field  is 
locally  inconsistent  with  previously  found  motion 
transformations.  The  Hough  transform  is  applied 
separately  to  the  displacement  vectors  in  each 
window.  Thus,  the  sensitivity  of  the  Hpugh 
transform  to  local  events  is  increased  and  the 


motion  parameters  of  small  objects  can  be  detected 
even  in  a  noisy  displacement  field.  A 
multiresolution  scheme  in  both  the  image  plane  and 
the  parameter  space  reduce  the  computational  cost, 
while  still  maintaining  accuracy. 

The  algorithm  has  been  shown  to  be  efficient 
and  robust  in  extracting  motion  parameters  from 
artificial  images  with  objects  undergoing  2D 
motion.  It  involves  a  4-dimensional  parameter 
space  of  horizontal  translation,  vertical 
translation,  rotation  (in  the  image  plane)  and 
expansion/contraction . 

The  current  research  invovles  the  extension  of 
this  approach  to  3D  motion  and  to  real  scenes. 
This  extension  is  non-trivial  because  displacement 
vectors  in  the  2D  motion  case  involve  four 
parameters  with  two  constraints;  thus,  each 
displacement  vector  "votes"  for  a  two-dimensional 
hyperplane  of  the  parameter  space.  In  the  case  of 
3D  motion  when  surface  depth  is  unknown,  there  will 
be  5  motion  parameters,  and  each  displacement 
vector  provides  only  one  constraint ;  i.e.,  each 
will  vote  for  a  four-dimensional  subspace  of 
parameter  cells.  Thus,  the  signal  to  noise  ratio 
in  the  parameter  space  will  be  much  lower,  and  with 
the  presence  of  noise  in  real  images,  the 
determination  of  peaks  in  generalized  Hough  space 
will  be  challenging. 


II.  FEATURE  MATCHING  BY  HIERARCHICAL  CORRELATION 

Feature  matching  algorithms  are  important  in 
problems  involving  motion  detection,  image 
registration,  and  stereo  vision.  Hierarchical 
correlation  provides  a  computationally  efficient 
feature  matching  strategy.  They  can  be  implemented 
in  hierarchical  parallel  hardware  architectures, 
and  they  can  also  be  implemented  on  a  sequential 
machine  to  run  very  efficiently  using  a  coarse  to 
fine  matching  strategy. 

Glazer ,  Reynolds,  and  Anandan  [4]  have 
developed  a  hierarchical  matching  algorithm  that 
consists  of  matching  band-passed  versions  of  the 
images  at  different  levels  of  resolution.  The 
filters  approximate  convolution  of  a  Laplacian  and 
a  Gaussian  (del-squared-G)  of  different  sizes. 
Alternative  computational  techniques  for 
implementing  the  band-pass  filter  are  being 
examined.  Che  technique  involves  computing  the 
del-squared-G  at  the  finest  level  followed  by  a  4x4 
Gaussian  centered  on  2x2  windows  to  reduce  the 
resolution  by  a  factor  of  two  on  each  axis.  These 
algorithms  are  computed  in  the  processing  cone  [8] 
of  the  VISIONS  Image  Operating  System  [9]. 

The  matching  is  performed  first  on  the  low 
frequency  structures  occurring  at  the  coarsest 
levels  of  the  images,  thus  providing  a  coarse  to 
fine  strategy  for  matching  higher  frequency 
information  at  the  levels  below.  This  reduces  the 
problem  of  false  matches  when,  for  example,  there 
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is  high  frequency  texture  with  somewhat  repetitive 
patterns.  Thus  all  useful  information  of  the 
image  is  utilized  at  different  levels:  low 

frequency  information  at  coarser  levels  and  higher 
frequency  information  at  finer  levels. 

The  correlation  strategy  utilizes  the 
observation  that  at  some  sufficiently  coarse  level, 
the  maximum  displacement  of  an  image  event  between 
a  pair  of  images  is  at  most  one  pixel.  This 
restricts  the  search  at  that  level  to  a  3x3  area 
and  provides  an  estimate  of  displacement  within  + 
1/2  pixel  accuracy.  The  projection  of  this 
estimate  to  the  next  finer  level  provides  an 
estimated  displacement  of  +  1  pixel  and  allows 

search  to  again  be  restricted  to  a  3x3  area,  with 
the  process  repeating  downward.  There  are  two 
significant  computational  advantages  of  this 
process.  The  number  of  correlation  matches 
considered  is  9*logD  instead  of  (2D+1)#*2,  where  D 
is  the  maximum  displacement  possible  at  the  finest 
level  of  resolution.  In  addition,  an  8x8 
correlation  window  size  was  used  at  all  levels,  and 
this  would  require  a  window  of  size  (8D)**2  to 
capture  the  same  amount  of  information  in  a  single 
level  of  search  across  correlation  positions. 

The  algorithm  has  shown  in  practical 
experiments  to  be  effective  in  determining  even 
small  amounts  of  rotation,  seems  to  be  insensitive 
to  noise,  and  of  course  is  very  efficient. 
Experiments  have  shown  that  it  may  not  be  necessary 
to  apply  the  algorithm  to  restricted  sets  of 
interesting  points  that  have  a  high  degree  of 
distinctiveness  (such  as  corners).  Some 
experiments  have  shown  consistently  correct  results 
using  all  points,  and  thus  might  work  on  an 
arbitrary  sampling  of  points. 


III.  A  CONTENT  ADDRESSABLE  ARRAY 
PARALLEL  PROCESSOR  (CAAPP) 

Our  research  environment  has  maintained  a 
continuous  interest  in  parallel  architectures  and 
parallel  algorithms.  Real-time  motion  processing 
will  require  between  one  and  two  orders  of 
magnitude  more  computational  power  than  static 
vision.  Thus,  VLSI  technology  and  massively 
parallel  machines  are  obvious  research  directions. 

Weems,  Levitan,  and  Foster  [11]  have  developed 
a  design  for  a  Content  Addressable  Array  Parallel 
Processor  (CAAPP)  and  have  been  applying  it  to  the 
motion  algorithms  with  Lawton  [5].  The  CAAPP  is 
both  a  512x512  Single  Instruction  Multiple  Data 
(SIMD)  array  processor  and  an  associative  memory. 
The  design  is  based  on  a  64x64  array  of  custom  VLSI 
chips  and  is  intended  to  act  as  a  slave  processor 
for  a  general  purpose  computer  system.  Each  chip 
then  contains  64  cells,  an  instruction  decoder,  and 
some  miscellaneous  logic.  There  are  eight  basic 
instruction  types  recognized  by  the  chip,  each 
performed  in  parallel  by  the  constituent  cells. 
Most  instructions  take  one  minor  cycle  time  (100 


nanoseconds)  to  execute.  Inter-cell  communication 
is  bit  serial  and  is  accomplished  by  a  four -way  (N, 
r,  E,  W)  cell  interconnect  network,  allowing  for 
three  types  of  edge  treatments:  dead-edging, 
circular  wrap,  and  zig-zag  wrap.  The  entire  memory 
may  be  bulk-loaded  in  one  video  frame  time  (1/30 
second) . 

A  very  interesting  application  developed  for 
the  CAAPP  (that  makes  use  of  the  associativity  and 
array  processing  capabilities)  is  an  effective 
means  of  quickly  and  accurately  decomposing  a  flow 
field  into  its  rotational  and  translational 
components  to  recover  the  parameters  of  sensor 
motion.  The  algorithm  is  an  exhaustive  search 
procedure  via  a  top-down  parallel  correlation  of  a 
set  of  rotational  and  translational  flow  field 
templates  to  find  a  component  pair  which  most 
closely  accounts  for  the  motion  depicted  in  a  given 
flow  field.  Currently,  1000  rotational  templates 
and  200  translational  templates  are  used.  Each 
cell  contains  the  horizontal  and  vertical 
components  of  a  flow  vector,  each  specified  with  10 
bits  of  precision. 

Experiments  have  been  performed  with  a  CAAPP 
simulator  on  a  VAX  11/780  using  a  wide  variety  of 
motions  and  simulated  environments.  In  all  cases 
examined,  the  translational  template  closest  to  the 
actual  translational  motion  was  selected.  The 
rotational  template  was  always  close  to  the  actual 
rotational  motion,  but  was  sometimes  not  the 
closest  template.  The  procedure  proved  to  be 
resistant  to  limited  Gaussian  noise  as  well  as  to 
limited  random  spike  noise  in  the  original  flow 
field.  The  CAAPP  timing  calculations  revealed  that 
the  algorithm  could  perform  the 
rotational-translational  decomposition  in  slightly 
more  than  1/4  second.  Given  fabrication  techniques 
available  in  the  immediate  future,  execution  times 
can  be  expected  to  be  significantly  improved. 

Using  the  CAAPP  strictly  as  a  parallel  array 
processor  it  is  of  course  possible  to  perform 
standard  image  processing  operations  such  as 
convolution.  For  example,  a  simple  3x3  Gaussian 
mask  convolution  can  be  done  in  98  microseconds  on 
the  CAAPP.  It  should  be  noted  that  the  time 
required  to  perform  a  convolution  on  the  CAAPP  is 
constant  for  a  given  image  size  and  only  varies 
depending  on  the  size  and  complexity  of  the  mask. 
For  example,  a  10x10  mask  of  8  bit  multipliers 
applied  to  an  image  of  16  bit  pixels  (with  the  same 
number  of  pixels  as  the  previous  example)  would 
require  on  average  approximately  30  milliseconds 
(about  one  frame  time).  The  method  used  is  not 
restricted  to  square  masks  and  is  actually  easily 
adapted  to  such  shapes  as  annuli  and  disjoint 
areas . 
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IV. 


RULE  BASED  STRATEGIES  FOR 
IMAGE  INTERPRETATION 


As  part  of  the  VISIONS  long-term  project  for 
the  interpretation  of  static  images,  we  have 
developed  an  experimental  testbed  for  examining 
issues  in  knowledge  directed  processing.  Weymouth, 
Griffith,  Hanson  and  Riseman  [6]  have  Deen 
developing  a  rule  based  image  interpretation  system 
which  has  been  effective  in  interpreting  a  set  of 
complex  outdoor  scenes.  The  system  utilizes  world 
knowledge  in  the  form  of  simple  object  hypothesis 
ruies,  and  more  complex  interpretation  strategies 
attached  to  object  and  scene  schema,  to  reduce  the 
ambiguities  in  image  measurements. 

Descriptions  of  scenes,  at  various  levels  of 
detail  ,  are  stored  in  a  set  of  schema  hierarchies 
[12].  A  schema  graph  is  a  data  structure  defining 
an  expected  collection  of  objects,  such  as  a  house 
scene,  the  expected  visual  attributes  associated 
with  the  objects  in  the  schema  (each  of  which  can 
have  an  associated  schema),  and  the  expected 
relations  among  them.  This  stored  knowledge  can  be 
used  to  infer  the  presence  and  location  of  other 
objects,  or  verify  uncertain  hypotheses  via  spatial 
consistency  of  object  labels.  However,  in  order  to 
use  this  knowledge  there  must  be  a  basis  for 
partial  interpretations. 

In  the  initial  stages,  there  are  few  if  any 
image  hypotheses,  arid  development  of  a  partial 
interpretation  must  rely  primarily  on  general 
knowledge  of  expected  object  characteristics  that 
are  independent  of  other  hypotheses.  We  propose  an 
approach  to  object  hypothesis  formation  which  is 
both  simple  and  effective.  It  relies  on  convergent 
evidence  from  a  variety  of  measurements  and 
expectations.  The  rules  involve  sets  of  partially 
redundant  features  each  of  which  defines  an  area  of 
feature  space  which  represents  a  "vote"  for  an 
object.  The  features  include  color,  texture, 
shape,  size,  image  location,  and  relative  location 
to  other  objects.  For  example,  in  an  outdoor  scene 
taken  with  a  camera  in  standard  position,  one  would 
expect  grass  to  be  of  medium  brightness,  to  have  a 
significant  green  component,  to  embody  a  modest 
degree  of  texture,  to  be  located  somewhere  in  the 
lower  portion  of  the  image,  etc.  These 
expectations  are  translated  into  a  rule  which 
combines  the  results  of  many  measurements  into  a 
confidence  level  that  the  region  (or  group  of 
regions)  represents  grass. 

Convergent  evidence  from  multiple 
interpretation  strategies  is  organized  by  top-down 
control  mechanisms  in  the  context  of  a  partial 
interpretation.  The  extreme  variations  that  occur 
across  images  can  be  compensated  for  somewhat  by 
utilizing  an  adaptive  strategy.  This  approach  is 
based  on  the  observation  that  the  variation  in  the 
appearance  of  objects  (region  feature  measures 
across  images)  is  much  greater  than  object 
variations  within  an  image.  Oie  such  strategy 
extends  a  kernel  interpretation  derived  through  the 
selection  of  object  exemplars,  which  are  regions 
that  represent  the  most  reliable  image  specific 
hypotheses  of  a  general  object  class.  The  use  of 
exemplar  strategies  and  other  top-down  strategies 


results  in  the  extension  of  partial  interpretations 
from  islands  of  reliability.  Finally  a 
verification  phase  can  be  applied  where  relations 
between  object  hypotheses  are  examined  for 
consistency.  Thus,  the  interpretation  is  extended 
through  matching  and  processing  of  region 
characteristics  as  well  as  semantic  inference. 

Experiments  are  being  conducted  on  a  set  of 
fifteen  "house  scene"  images.  Thus  far,  we  have 
been  able  to  extract  sky,  grass,  and  foliage  (often 
separating  trees  and  bushes)  from  nine  house  images 
with  reasonable  effectiveness,  and  have  been 
successful  in  identifying  houses  and  their  parts, 
including  shutters  (or  windows),  house  wall  and 
roof  in  three  of  these  images.  The  interpretation 
strategies  use  many  redundant  features,  each  of 
which  can  very  often  be  expected  to  be  present. 
The  premise  is  that  many  redundant  features  allow 
any  single  feature  to  be  unreliable.  The  features 
utilized  include  those  mentioned  earlier  (color  and 
texture  attributes,  shape,  size,  location  in  the 
image),  as  well  as  relative  location  to  identified 
objects,  and  similarity  in  color  and  texture  to 
identified  objects.  Object  hypothesis  rules  were 
employed  as  described  in  previous  sections,  and 
additional  object  verification  rules  requiring 
consistent  relationships  with  other  object  labels 
are  being  developed. 
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1.  Robust  Vision  Operators 

1.1,  Parameter  Networks  and  the  Hough  Transform 

One  of  die  most  difficult  problems  in  vision  is 
segmentation.  Recent  work  has  shown  how  to  calculate 
intrinsic  images  (e.g.,  optical  flow,  surface  orientation, 
occluding  contour,  and  disparity).  These  images  are 
distinctly  easier  to  segment  than  the  original  intensity 
images.  Such  techniques  can  be  greatly  improved  by 
incorporating  Hough  methods.  The  Hough  transform  idea 
has  been  developed  into  a  general  control  technique. 
Intrinsic  image  points  are  mapped  (many  to  one)  into 
•parameter  networks’  [Mallard,  1983],  This  theory  explains 
segmentation  in  terms  of  highly  parallel  cooperative 
computation  among  intrinsic  images  and  a  set  of 
parameter  spaces  at  different  levels  of  abstraction, 

The  most  recent  application  of  these  ideas  are  to 
improved  shape  troin-shading  calculations  which  work  on 
several  spaces  [Hrown  et  al.  1983|  and  motion  extraction 
[Mallard  &  Kimball,  1983],  This  domain  specific  effort  is 
closely  linked  to  our  new  work  on  a  more  general  theory 
of  Hough  like  computations  and  general  implementation 
techniques  for  them. 

I  he  theory  is  also  useful  in  analysis  of  cache-based 
Hough  Iransform  implementations.  It  is  an  appealing  idea 
to  use  a  small  content-addressable  store  to  accumulate 
Hough  transform  results,  rather  than  a  potentially  huge 
multi-dimensional  array.  I  he  initial  technical  issues  were 
discussed  in  [Jirown  <k  Sher,  I982J,  We  are  currently 
pursuing  VLSI  implementations. 

1.2  f lough  Transform  Implementation 

Larlier  work  on  the  Hough  transform  [Hrown,  1983; 
Hrown  &  Sher,  1982]  has  led  in  three  directions. 

1)  Research  toward  a  theory  of  cache  accumulator 
arrays  [Loui,  1983;  Hrown  &  Feldman,  1983] 

2)  Kxperiments  with  complementary  HI'  and 
cache  management  strategies  [Hrown  et  al.,  1983] 

3)  Hardware  (VLSI)  designs  for  HT  vote  caches 
[Sher  &  Tevanian,  1983], 


Work  in  each  of  these  directions  is  in  progress;  some 
of  the  cited  references  are  draft  documents.  The  behavior 
ol  caching  schemes  for  accumulation  of  votes1  in  the 
Hough  transform  is  equivalent  to  the  statistical  problem  of 
estimating  the  mode  of  a  distribution  using  only  a  finite 
memory  for  vote  tallies,  and  is  a  generalization  of  the 
familiar  ‘secretary’  ('maximum  of  a  sequence,’  'beauty 
contest’)  problem,  loui’s  document  explores  this  avenue 
for  analysis.  The  experiments  with  HT  implementation  are 
to  see  how  well  the  peak-sharpening  provided  by 
complementary  H  I  performs  with  real  images  on  complex 
shapes,  Work  on  cache  architectures  (hierarchical  schemes, 
cascaded  caches)  is  ongoing. 

The  VLSI  design  project  produced  a  circuit  for  vote 
cacheing  that  can  be  cascaded  to  provide  a  cache  of  any 
length,  Work  on  improving  the  efficiency  and  power  of 
the  design  will  continue  this  summer. 

1.3  High  Level  Planning 

In  general,  problem  solvers  cannot  hope  to  create  plans 
that  are  able  to  specify  fully  all  the  details  of  operation 
beforehand  and  must  depend  on  run  time  modification  of 
the  plan  to  insure  correct  functioning.  The  run  time 
planning  idea  becomes  particularly  important  when 
different  plan  segments  are  being  explored  concurrently. 

I  hese  communicating  segments  may  require  sophisticated 
actions  e.g.  (do  PLANX  until  PI.ANy).  These  issues  are 
being  studied  by  [Russell]  in  the  context  of  a  cooperative 
planning  and  execution  system  for  manipulation  tasks. 

2.  Computing  with  Connections 

We  are  continuing  our  interest  in  problem-scale 
parallelism,  both  as  a  model  of  animal  brains  and  as  a 
paradigm  for  VLSI.  W'ork  al  Rochester  has  concentrated 
on  connectionist  models  and  their  application  to  vision. 

I  he  Iramework  is  built  around  computational  modules, 
the  simplest  of  which  are  termed  p  units.  We  have 
developed  their  properties  and  shown  how  they  can  be 

applied  to  a  variety  of  problems  [Feldman  &  Mallard, 
1982],  More  recently,  we  have  established  powerful 
techniques  for  adaptation  and  change  in  these  networks 
[Feldman,  1982], 
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A  major  milestone  was  achieved  with  Sabhnh's  thesis 
on  massively  parallel  recognition  of  Origami  world  objects 
[Sabbah  1982],  Sabbah’s  work  extended  the  conneetionist 
methodology  to  a  problem  domain  with  several 
hierarchical  structural  levels.  The  resulting  program  is,  to 
our  knowledge,  the  most  noise  resistant  system  for  dealing 
with  this  level  of  complexity,  One  outcome  of  Sabbalfs 
effort  has  been  a  project  to  build  a  general  purpose 
simulator  for  massively  parallel  systems  [Small  et  al„ 

1982] , 

Che  general  conneetionist  simulator  has  been  well 
tested  and  is  being  used  in  a  number  of  applications.  One 
project  involves  a  quite  detailed  simulation  of  motor 
control  networks  of  the  ocenlo  motor  system  [Addanki, 

1983] ,  Another  application  is  to  a  spreading  activation 
model  of  word  sense  disambiguation  and  related  problems 
in  natural  language  understanding  [Cottrell  &  Small, 
1983],  A  major  new  effort  involves  modelling  conceptual 
knowledge  (such  as  that  needed  for  high  level  vision)  in 
conneetionist  terms,  1'he  simulator  has  also  been  a  starting 
point  for  some  of  our  efforts  towards  VLSI  realization  of 
conneetionist  machines  (Section  5), 

Lor  a  VLSI  design  couse,  a  circuit  was  designed  to 
implement  key  aspects  of  the  "conneetionist" 
computational  paradigm  [Rainero  &  Kant/,  1983],  This 
cited  document  is  a  course  project  report,  and  the  exercise 
was  mainly  useful  in  isolating  particular  technical 
problems  that  must  be  addressed  in  any  such  parallel, 
activation  passing  computer. 

3.  Motion 

Our  interest  in  motion  has  centered  around  methods 
for  extracting  rigid  body  parameters  from  optic  llow  and 
intensity  images,  I'hese  parameters  are  extremely  useful  in 
navigation  and  target  tracking,  Currently  these  nine 
parameters  (origin,  translational  velocity,  rotational 
velocity)  can  be  extracted  from  flow  via  a  Hough 
technique  [Bullard  &  Kimball,  1983],  We  are  also  pursuing 
the  use  of  these  parameters  to  speed  up  the  flow 
compulations  themselves  [Stuth  et  al„  1983], 

4.  Shape 

I  he  description  and  recognition  of  complex  shapes 
continues  to  be  a  major  focus  of  the  project,  fhe  analysis 
of  the  dot  product  space  representation  has  been  improved 
to  handle  certain  pathological  cases,  and  has  been 
generalized  to  accommodate  different  criteria  for  the  * 
goodness  of  the  representation, 

'This  simple  concept  of  shape  lias  been  applied  to  the 
problem  of  reconstructing  three-dimensional  surfaces  from 
very  sparse  data.  I  he  key  idea  is  to  use  appropriate  shape 
descriptors  to  hypothesize  a  transformation  which  accounts 
for  the  difference  in  shape  between  successive  contours. 
When  the  hypothesized  transformation  is  minor,  very 
simple-minded  surface  reconstruction  techniques  are 
sufficient  When  there  are  major  differences  in  shape  or 


position  between  successive  contours,  otn  method 
hallucinates  new  coutuurs,  using  the  hypothesized  shape 
transformation  [Sloan  &  llreclianyk,  1981], 

Hierarchical  descriptions  of  shapes  were  considered  in 
[Ballard  &  Sabbah,  1981]  in  a  preliminary  fashion.  Our 
previously  reported  shape  model  [Hrechanyk  V  Mallard. 
1982]  concentrated  on  problems  of  view  invariance  and 
attention  shilling  within  a  single  piutolype.  This  model  lias 
been  extended  to  handle  the  problems  of  extracting 
primitive  shape  descriptions  from  noisy  images.  Our  work 
was  motivated  by  dissatisfactions  with  smoothness  criteria 
for  intrinsic  image  amputations.  Our  current  model  uses 
correspondences  tailed  view  frames  as  primitives.  This 
model  allows  gestalt  grouping  to  be  modelled  as  well  as 
parallel  search  tor  prototypes  and  parameter  tracking 
[Hrechanyk  &  Ballard,  (this  Proceeding)], 

The  practicality  of  shape  from  shading  computations 
and  their  interaction  with  the  determination  of  other 
image  parameters  (such  as  illiimmanl  position)  was 
addressed  by  two  papers  in  the  Tall,  1982  l)ARPA  Image 
Understanding  Workshop.  Since  then  we  have 
implemented  a  multi-resolution  shape  from  shading 
algorithm  that  exhibits  high  efficiency  and  acciiiaey  in 
surface  reconstruction  of  large  (128  x  128)  irregular  shapes 
(Tig ure  1).  We  are  now  applying  the  algorithm  to  real 
images,  and  want  to  investigate  scenes  with  non 
Lambertian  reflectance  functions  that  are  unknown 
apriori,  We  want  to  explain  how  humans  in  fact  use 
shading  to  derive  shape,  given  the  complexity  ot 
reflectance  functions  and  imaging  situations  in  the  world. 
T  wo  competing  theories  are  that  somehow  the  reflectance 

functions  are  derived  fairly  accurately  by  an  adaptive 
procedure,  or  instead  that  we  only  'support  a  small 
number  of  reflectance  functions  lhal  are  selected  by  oilier 
cues  (such  as  gloss). 

5.  General  Theory  of  Vision 

Work  in  our  laboratory,  among  others,  has 
demonstrated  strong  links  between  powerful  IU 
techniques  and  computations  used  by  animal  visual 
systems,  We  have  established  strong  ties  with  a  wide  range 
of  visual  scientists  at  Rochester  and  a  variety  of 
collaborative  efforts  are  underway.  One  early  project  is  to 
survey  the  computational  similarities  in  natural  and 
computer  vision  [Ballard  &  Coleman,  1983], 

We  have  begun  to  exploit  Rochester  neurobiology 
expertise  in  order  to  hone  and  improve  our  conneetionist 
modelling  efforts.  One  difficult  avenue  is  to  specify  the 
interface  between  our  computational  models  and  the  state- 
of-the-art  neurobiological  picture.  Our  etlorts  in  this 
direction  are  summarized  in  [Ballard  &  Coleman.  1983] 
and  the  collaboration  is  continuing.  Another  effort  is  our 
attempt  to  develop  a  general  framework  for  theories  oi 
vision  that  would  provide  a  common  structure  lor 
integrating  studies  from  various  disciplines  [Feldman, 
1982], 
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Hyure  1:  Surface  Reconstruction  of  Large  Irregular  Sliu|»cs 
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I.  INTRODUCTION 

This  paper  summarizes  our  major 
research  activities  for  the  period  of 
October  1982  to  April  1983.  More  details 
can  be  found  in  other  technical  papers  in 
the  proceedings  of  this  workshop  [1-2]  and 
the  proceedings  of  the  computer  vision  and 
pattern  recognition  conference  being  held 
concurrently  with  this  workshop  [3-5].  Our 
main  focus  has  continued  to  be  on 
developing  a  high-level  symbolic  matching 
system  that  would  be  useful  for  the  tasks 
of  map-updating,  autonomous  navigation  and 
object  recognition.  We  have  largely  used 
aerial  images  for  testing,  but  the 
techniques  should  also  apply  to  other 
domains.  We  have  also  been  working  on 
generating  better  descriptions,  including 
improved  segmentation,  shadow  analysis  and 
stereo.  We  have  also  continued  work  with 
Hughes  Research  Laboratories  on  hardware 
implementation  of  IU  algorithms. 


II.  SYMBOLIC  MATCHING 

Our  recent  work  in  this  area  has  been 
primarily  in  extensive  testing  and 
evaluation  of  our  previously  reported 
matching  methods  [6-7].  We  have  compared 
our  relaxation  matching  scheme  to  a  variety 
of  others,  using  different  convergence  and 
confidence  updating  criteria.  These  tests 
indicate  that  criterion  optimization  method 
is  superior  in  terms  of  the  number  of 
iterations  needed  and  in  the  accuracy  of 
the  results. 

We  have  applied  our  line  matching 
technique  to  the  inspection  of  printed 
circuit  boards  for  missing  or  improperly 
inserted  parts.  The  system  developed  for 
aerial  images  required  only  small 
modifications  to  incorporate  a  more  complex 
model  representation;  these  efforts  are 
described  in  [3]. 


III.  STEREO  MATCHING 

Conventional  stereo  matching  uses 
correlation  of  intensities  in  some  form,  in 
selected  windows  of  two  images.  Some  of 
the  more  modern  approaches,  e.g.  [8,9], 
match  edges,  which  are  likely  to  be  more 
invariant.  We  have  recently  started 
experimenting  with  matching  of  line 
segments.  Initially,  we  attempted  to  apply 
our  above  cited  line  matching  method; 
however,  the  distortions  inherent  in  stereo 
images  led  us  to  a  different  matching 
criterion.  Essentially,  our  system  finds 
the  set  of  matches  that  gives  minimum 
"differential  dispanity",  i.e.  the  flattest 
consistent  interpretation.  This  system 
needs  further  development,  but  the  initial 
results  are  very  promising;  this  work  is 
described  in  detail  in  [1], 


IV.  SHADOW  ANALYSIS 

We  have  been  working  on  using  shadows 
to  extract  heights  of  structures;  our  work 
on  extracting  heights  of  buildings  by  using 
a  priori  knowledge  of  their  shapes  has  been 
reported  previously  [10].  We  have 
gener  L.ed  this  work  for  other  objects, 
e.g.  oil  tanks,  by  using  the  known 
direct  ion  of  illumination  strongly  to  wake 
correspondence  between  object  boundaries 
and  _heir  shadows.  Some  of  our  new  work  is 
described  in  [4]. 


V.  SEGMENTATION 

We  have  developed  a  new  texture 
segmentation  system  that  uses  relatively 
simple  measures  of  texture  uniformity.  The 
segmentation  is  hierarchical,  a  low 
resolution  segmentation  is  used  to  compute 
a  more  accurate  segmentation  at  a  higher 
resolution  level.  The  method  is  described 
in  [2]. 
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Abstract 

This  paper  demonstrates  how  image  features  of 
linear  extent  (lengths  and  spacings)  generate  nearly 
image-independent  '-constraints  on  underlying  surface 
orientations.  General  constraints  are  derived  from  the 
shapc-from-textnre  paradigm-  then,  certain  sn'-cial  cases 
are  shown  to  be  especially  useful.  Under  orthography,  the 
assumption  that  two  extents  are  equal  is  shown  to  be 
identical  to  the  assumption  that  an  intake  angle  is  a  right 
angle  (i.e.  orthograidiic  extent  is  a  form  of  slope  or 
skewed  symmetry).  Under  perspective,  if  image  extents 
are  assumed  equal  and  parallel,  extent  again  degenerates 
into  slope.  In  the  general  perspective  case,  the  shape 
constraints  are  usually  complex  fourth-order  equations, 
but  they  often  siinplify — even  to  graphic  constructions  in 
the  image  space  itself.  If  image  extents  are  colinear  and 
assumed  equal,  the  constraint  equations  reduce  to  second 
order,  with  several  graphic  analogs.  If  extents  are 
adjacent  as  well,  the  equations  are  first  order  and  the 
derived  construction  (the  ^‘jack-knife  method")  is 
particularly  straightforward  and  general.  This  method 
works  not  only  on  measures  of  extent  per  texel,  but  also 
on  reciprocal  measures:  texels  per  extent.  Several 
examples  and  discussion  indicate  that  the  methods  are 
robust,  deriving  surface  information  cheaply,  without 
search,  where  other  methods  must  fail.* 

1  Introduction 

In  this  paper,  we  show  how  certain  simple  aggregate 
linage  properties  involving  spatial  extent  along  one 
dimension  can  be  used  as  cues  for  determining  underlying 
three-dimensional  surface  orientation.  Image-measurable 
properties  such  as  lengths  and  spacings  are  shown  to 
generate  constraints  on  local  surface  slope  in  a  nearly 
image-independent,  way.  The  derivation  of  these 
relationships  is  identical  in  analytic  method  (“shape  from 
texture”)  and  representational  structure  (the  gradient 
space)  to  those  derived  for  other  imaging  phenomena  such 
as  skewed  symmetry  or  image  slope.  Thus,  they  provide 
additional  surface  information  in  a  form  (either  equation 
or  graph)  that  is  easily  integrable  with  that  of  other 
existing  algorithms. 

Linear  extents  are  measurements  along  a  straight 
image  line  of  either  objects  (in  which  case  they  are 
lengths)  or  virtual  objects  (in  which  case  they  are 
spacings).  The  exact  form  of  the  input  to  these  analyses 
can  vary.  A  prior  edge; detection  and  linking  step,  or  a 
segmentation-like  step  is  assumed.  Lengths  are  then 
linear  measures  of  image  tokens  such  as  elongated  blobs, 
and  spacings  are  linear  measures  of  the  virtual  lines 
between  image  tokens.  Spacing  behaves  the  same  way  as 
length  does;  often  it  is  more  conveniently  available. 


*  I  his  research  was  sponsored  in  part  by  the  Defense 
\dv sliced  Research  Projects  Agency  under  contract 
N000;i0-82-C  -0127. 


In  general,  this  paper  follows  the  image 
understanding  conventions  presented  in,  among  other 
places,  [Render  80a],  That  is,  the  image  coordinate 
system  considers  the  z  axis  to  be  positive  in  the  direction 
of  view;  the  image  itself  to  be  plane  z=l,  which  has  been 
rotated  in  front  of  the  lens  at  the  origin;  and  the  unit  of 
length  in  the  system  to  equal  the  focal  length  of  the  lens. 
Surfaces  in  the  scene  are  locally  represented  by  planar 
patches,  and  the  surface  gradient  of  the  patch 
z=px+qv+e  is  represented  by  the  point  (p,q),  its 
gradient,  in  the  gradient  space. 

The  problem  of  deriving  surface  information  from 
textural  and  regularity  assumptions  occurs  in  two  steps. 
First,  the  textural  element— in  this  case  an  image  extent— 
is  backprojected  onto  a1!  surface  patches  possible.  A  map 
of  the  scenic  measure  of  the  component  is  recorded.  The 
recovered  scene  extents  are  usually  a  function  of  the 
image  extent’s  position  and  the  surface’s  parameters.  In 
Die  second  step,  two  or  more  nearby  textural  elements  are 
assumed  to  be  equal  in  measure  in  the  scene. 
Mathematically  this  means  that  the  maps  can  be 
intersected  to  find  those  surface  patch  parameters  that 
generate  for  each  texel  the  same  measure  (that  is,  the 
same  texture).  Heeause  the  gradient  space  is  coupled  to 
the  image  space-a  rotation  in  one  induces  an  equal 
rotation  in  the  other-the  problem  of  backprojecting 
textural  elements  often  is  can  be  simplified  by  factoring 
out  rotation.  That  is,  the  camera  roll  component  does  not 
affect  the  depth  and  surface  information  of  the  image  in 
any  significant  way. 

2  Extents  under  Orthography 

l  or  the  case  of  spatial  extent  under  orthography, 
the  rotational  coupling  reduces  it  to  the  problem  of 
back-projecting  a  single  horizontal  extent  between  the 
joints  (a,y)  and  (b,y),  where  L=(b-a)  is  the  image  extent 
s  >e  f  igure  1).  r  nrther,  the  Jacobian  of  the  deprojection 
mapping  of  image.,  space  onto  the  surface  space  is 
constant.  (It  is  equal  to  ||NI|,  the  norm  of  the  surface 
normal  N=(p,q,-1].)  Thus  all  induced  surface  extents  are 
preserved  under  translations  of  their  sources  in  the  image, 
and  any  candidate  image  extent  can  be  translated  to  tlie 
origin.  The  problem  then  reduces  to  that  of 
backprojecting  the  line  from  (0,0)  to  (L,’0):  a  problem  with 
one  Tree  parameter.  Such  simplifications  characterize 
orthographic  projection  in  general,  but  the  resulting 
texture  maps  are  often  weak  in  analytic  power,  as  the 
following  discussion  shows. 

Backprojecting  the  image  point  (x,y,l)  onto  the 
plane  with  equation  z=px+qy+c  is  achieved  by  the 
transformation:  (x,y,l)  becomes  (x,y,px+qy+e).  Without 
bothering  to  set  up  a  detailed  scene  coordinate  system,  it 
is  easy  to  see  that  the  two  points  backproject  to  (0  0,e) 
and  (L,0,pL+c),  respectively.  The  scene  extent  is 
therefore  L  sqrtj  l  +  p“),  which  is  a  function  of  p  and  q: 
this  is  the  normalized  texture  property  map. 
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Figure  1:  Back  projecting  an  extent  under  orthography. 

Note  Hint  the  q  component  does  not  affect  the  scene 
extent.  This  is  because  q  measures  departme  from 
“verticality”,  and  a  horizontal  line  is  only  affected  in 
backprojection  by  that  component  which"  alters  ‘  left- 
right.’  slant  the  p  component).  When  graphed  in  the 
gradient  space,  lliis  map  has  the  property  that  the 
normalized  texture  property  of  extent  is  a  hyperboloid  of 
ope  sheet,  with  a  minimum  value  of  L  on  the  q  axis. 
Therefore,  baekproiection  under  orthography  never 
decreases  measures  of  extent. 

To  use  this  normalized  texture  property  map 
(NTI’M),  consider  an  image  with  two  extents  in  it 
Suppose  they  arise  from  parallel  extents  in  the  scene;  this 
parallelism  is  carried  over  into  the  image.  But  under 
orthography,  either  extent  can  be  translated  into 
superimposition  on  the  other.  Thus,  if  they  are  of  equal 
extent,  they  (and  their  NTI’Ms)  exactly  coincide  and  no 
further  information  about  the  scene  is  obtained.  If  they 
are  unequal  in  extent,  they  will  superimpose  with  unequal 
overlap;  since  the  coupling  of  the  gradient  space  maps 
also  causes  the  NTI’Ms  to  superimpose,  there,  is  no 
solution.  That  is,  jf  Li  and  L0  differ,  L.sqrt(l+p“)  never 
equals  Lpsqrt(l+p“),  and  there  is  no  surface  patch  that 
can  suppbrl  two  equal,  parallel  scene  extents  if  there  are 
unequal  image  extents.  Thus  parallelism  of  equal  extents 
under  orthography  provides  no  information  about 
surfaces,  except  in  this  weak,  negative  fashion. 


Consider  Figure  2,  in  which  the  two  image  extents 
forming  their  angle  have  been  closed  off  with  the  addition 
of  a  line.  It  is  well  known  that  orthography  preserves 
midpoints  of  lines;  thus  the  image  figure,  with  yet  another 
line  connecting  the  vertex  to  this  midpoint,  can  be  seen  as 
a  scene  isosceles  triangle  in  perspective.  Given  this,  the 
angle  formed  by  the  altitude  to  the  base  in  the  scene  must 
be  a  right  angle:  this  is  the  Kanade  assumption  The 
surface  constraint  then  is  identically  derived. 


Figure  2:  Equal  extent  is  skewed  symmetry. 

A  special  case  occurs  when  image  extents  are 
themselves  equal,  as  with  the  corner  of  a  square  or 
rhombus.  The  alt  it  ude-to-base  angle  constructed  in  the 
image  is  found  to  be  a  true  right  angle  as  well.  The 
second  order  constraint  equation  now  degenerates  to  a 
linear  one.  Its  graph  in  the  gradient  space  is  represented 
by  two  perpendicular  lines  through  the  origin;  one  of 
them  is  parallel  to  the  triangle’s  base.  These  constrained 
surface  orientations  have  a  easy  interpretation:  the 
underlying  surface  could  have  pivoted  about  the  altitude, 
foreshortening  both  halves  of  the  triangle  equally,  or,  the 
surface  could  have  pivoted  about  the  base,  foreshortening 
the  entire  triangle,  but  without  skew.  Note  that  if  the 
image  is  assumed  to  be  that  of  a  square  corner,  it  can  be 
analyzed  solely  in  terms  of  slope  phenomena,  The  scene 
then  contains  two  right  angles:  the  corner  one  and  the 
induced  one.  (The  two  Kanade  hyperbolae  intersect  in 
the  gradient  space,  giving  a  Necker  pair  of  orientations.) 

Equal  extents  in  the  image  under  orthography 
therefore  either  give  trivial  results,  or  reduce  to  already 
known  cases  of  image  slope  and  angle.  (This  is  even  true 
for  some  other  textural  configurations  for  extents  which 
are  not  covered  here.) 


Now  suppose  the  extents  arise  from  non-parallel, 
but  equal  extents  in  the  scene.  This  situation  is  more 
interesting:  the  image  extents  can  be  translated  so  that  a 
air  of  their  ends  will  meet  and  form  an  angle.  Because 
le  image  extents  are  non-parallel  in  the  image  as  well, 
their  NTPMs  have  will  have  also  been  rotated  different 
amounts.  Further,  their  image  measures  are,  in  general, 
unequal,  so  the  NTI’M  intersection  is  non-trivial.  The 
resulting  constraint  equation  is  a  messy  one  in  terms  of 
Lj,  Li>,  their  joint  angle,  and  second  powers  of  p  and 
q.  However,  it  is  not  difficult  to  prove  that  the  constraint 
on  surface  orientation  that  it  induces  can  be  graphed  as  a 
hyperbola  in  the  gradient  space.  The  following 
construction  shows  that  the  hyperbola  is  the  Kanade 
hyperbola  [Kender  80b],  which  usually  arises  under  the 
assumption  that  a  given  image  angle  is  caused  by  a  scene 
right  angle. 


3  Extents  under  Perspective 

The  analysis  of  extents  under  central  perspective  is 
more  complex,  but  it  yields  more  powerful  algorithms  for 
image  understanding.  Under  perspective,  the  gradient 
space  remains  coupled  to  the  image  space,  still  saving  one 
degree  of  freedom  in  analysis  (the  camera  roll 
component).  However,  the  backprojection  function  is 
more  elaborate.  In  particular,  the  image  point  (x,y,l)  is 
taken  onto  the  surface  z=px+qy+c  by  f-c/(l-px- 
<Lv))(x.y,l).  This  mapping  has  a  non-linear  Jacobian; 
therefore,  the  mapping  of  image  extents  into  the  scene 
extents  is  critically  affected  by  translations  in  the  image. 
This  implies  that,  the  general  backprojection  of  an  image 
extent  of  measure  1  must  be  from  (a.v,l)  to  (b,v,l)  where 
E=(b-a),  since  no  simplifying  translations  are  possible. 


50 


vanishing  line 


Tims,  thorn  are  three  free  parameters.  The  two  points 
are  taken  into  (-c/(l-pa-qy))(a,y,l)  and  (-c/(  1-pb- 
qy))!  a, v,l),  respectively. 

The  induced  surface  extent  is  calculated  in  the  scene 
by  the  usual  Euclidean  metric,  yielding  a  complex  NTPM: 

L(  1  /( l-pa-qy))(l/(  1-pb-qy  ))sqrt((  1-py  )2+p2(y2+ 1 )). 

(The  calculation  here,  as  in  the  orthographic  case,  is 
somewhat  like  a  finite  difference  approximation  to  a 
derivative.  However,  under  perspective,  it  is  exactly  the 
finite  difference  that  is  needed,  since  what  is  important  is 
the  departures  from  linearity.)  For  ease  of  reference,  this 
NTPM  will  be  abbreviated  to 

L(l/(  1-pa-qy ))(!/( 1-pb-qy  ))S(p,y) 

Notice  that  the  function  S(p,y)  is  independent  of 
both  a  and  b. 

Theoretically  (or  even  practically),  this  function  is 
usable  in  its  raw  form.  That  is,  given  two  extents  in  an 
image  under  central  perspective,  it  is  possible  to  generate 
the  appropriate  NTPNIs  for  both  (subject  to  their  position 
and  orientation),  and  to  intersect  their  graphs— as  if  they 
were  Hough  accumulator  arrays.  The  result  would  be  a 
small  set  of  surface  or'ientations  which  would 
simultaneously  normalize  the  two  induced  surface  extents 
to  equal  measure.  However,  in  nearly  all  cases,  this 
involves  the  solutions  to  constraint  equations  that  are  of 
fourth  order  in  p  and  q.  Only  a  few  image  configurations 
generate  simple  surface  constraints.  (Some  configurations 
that  one  might  expect  would  reduce  the  complexity  do 
not:  for  example,  image  extents  that  are  radial  with 
respect  to  the  image  origin).  The  ones  that  do  simplify 
have  the  added  benefit  that  they  appear  to  be  relatively 
common. 

3.1  Equal  and  Parallel 

First  assume  that  image  extents  arose  from  scene 
components  that  were  not  only  equal  in  measure,  but 
were  parallel  on  the  scene  surface.  A  simple  construction 
(see  Figure  3)  shows  that  once  again  the  image 
configuration  can  be  handled  solely  by  considerations  of 
image  slope.  Two  equal  and  parallel  scene  lines  form  a 
parallelogram;  in  the  image,  their  pairs  of  sides  can  be 
extended  to  derive  two  vanishing  points.  Each  vanishing 
point  implies  a  linear  constraint  in  the  gradient  space:  if 
an  image  point  (x,y)  is  a  vanishing  point  of  a  surface,  then 
the  surface  must  have  a  gradient  (p,qj  which  satisfies 
px+qy=l  (  [Shafer  831).  Two  such  linear  constraints 
uniquely  define  a  vanishing  line,  which  in  turn  uniquely 
defines  the  surface  orientation. 

3.2  Equal  and  Colinear 

Assume  now  that  the  image  extents  did  not  arise 
from  parallel  scene  extents.  There  seems  to  be  only  one 
other  amplifying  set  of  cases:  those  wh- n  the  scene 
components  are  colinear.  Although  these  cases  also 
generate  vanishing  points,  interestingly,  they  do  not 
reduce  the  problem  again  to  one  of  image  slopes.  Nothing 
can:  colinear  extents  have  only  one  slope  in  common. 

The  images  of  colinear  scene  components  are  also 
colinear.  The  reverse  is  not  true,  though  the  heuristic 
positing  of  that  truth  of^n  is  most  useful.  It  would  be 
yet  another  preference  heuristic,  similar  to  those  used  in 
other  contexts  in  image  understanding:  for  example, 
nearbv  image  pixels  arise  from  actual  scene  patch 
neighbors  (shape  from  shading),  nearly  right  angles  arise 
from  scene  right  angles  (skewed  symmetry),  near-parallels 
arise  from  parallels  (one  form  of  shape  from  texture),  etc. 


Figure  3:  Equal  parallels  are  equivalent  to  slope. 

The  image  configuration  in  the  most  general  case 
reduces  to  the  following.  Four  points  lie  on  the  horizontal 
image  line  at  height  yj  they  are  A=(a,y).  with  B,  C,  and 
1)  defined  similarly).  These  four  points  define  two  image 
extents,  I,=(b-a)  and  R=(d-c),  respectively.  The 
assumption  of  cohncarily  allows  the  NTl’Ms  of  tne  extents 
to  be  put  into  correspondence  easily:  they  are  already  in 
the  proper  orientation,  due  to  the  one  shared  image  slope. 
Since  they  also  share  identical  terms  in  S(p,y),  equating 
the  NTPXl  yields  a  surface  constraint  that  reduces  to 
second  order  in  p  and  q: 

( 1-pa-qy )( 1-pb-qy  )/L  =  ( l-pc-qy)(  1-pd-qy  )/R 

Although  this  equation  can  be  exactly  solved,  it  has 
a  simplifying  graphic  construction  that  can  be  drawn  in 
the  image  space  itself,  directly  yielding  the  vanishing 
point(s).  Rewrite  it  in  the  following  form: 

(X-a)(X-b)/F  =  (X-c)(X-d)/R,  where  X  =  ( 1-qy )/p 

If  X  satisfies  the  constraint  equation,  then  scene 
extents  are  eciual,  as  desired.  Further,  this  is  a  very 
desirable  X:  it  also  satisfies  the  formal  definition  that 
pX+qy=l,  that  is.  the  point  (X,y)  is  a  vanishing  point. 
Note  that  (X.y)  lies  on  the  line  of  colinearity;  all  that 
must  be  calculated  is  the  value  of  X  itself  Formally,  the 
equation  is  of  the  form  of  the  intersection  of  two 
parabolae.  The  left  parabola  has  value  0  at  both  a  and  b, 
and  a  minimum  value  of  L/4  midway  between  them.  The 
right  parabola  is  exactly  of  the  same  shape,  except  for 
scaling  (its  midpoint  minimum  is  R/4).  Thus,  the  value  of 
X  can  be  graphically  determined  by  drawing  the 

fiarabolae  on  the  image,  and  finding  their  intersection. 
Notice  that  the  mathematics,  as  well  as  the  construction, 
inds  a  vanishing  point  between  b  and  c,  where  the  image 
lengths  are  on  opposite  sides  of  any  vanishing  line.) 


The  parabola  method  can  be  refined  in  the  following 
way.  Note  first,  that  it  is  not  necessary  to  draw  the 
parabolae  on  the  x  axis’  they  can  be  translated  upwards 
to  the  horizontal  line  of  eolinearity  itself.  Secondly,  the 
parabola  are  only  constrained  to  pass  through  the  point 
pairs;  their  exact  shape  is  not  critical,  as  long  as  the 
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parabolae  are  similar  (i.e.  they  can  be  mutually  scaled). 
1  bird  since  the  value  of  X  is  a  purely  formal  one,  the 
parabolae  can  be  imagined  to  be  drawn  out  of  the  image 
plane:  that  is,  either  parabola  can  be  though  of  as 
extending  into  the  -z  axis  direction. 

More  appropriately,  the  value  of  X  on  either 
parabola  can  be  considered  as  an  image  feature  in  its  own 
right.  The  calculation  is  really  a  type  of  local  feature 
assignment,  with  with  each  position  on  the  line  of 
colinearity  being:  assigned  two  simultaneous  features. 
That  position  where  the  features  are  identical  is  the 
vanishing  point. 

Parabolae  grow  very  quickly,  howevei,  away  from 
their  roots.  This  can  be  compensated  for  formally  by 
taking  the  square  root  of  this  linage  feature.  "The 
assignment  of  values  is  now  via  hyperbolae  of  similar 
shape,  which  grow  (sub)linearly.  They  also  have  the 
(aesthetic)  advantage  of  being  undefined  within  the  image 
extents  themselves,  the  interior  of  which  being  one  place 
where  a  vanishing  point  ought  not  be.  In  a  pinch,  the 
hyperbolae  can  also  be  approximated  by  their  asymptotes, 
which,  being  strictly  linear,  are  easier  to  compute.  For 
example,  the  left  hyperbola  is  sqrt((X-a)(X-b)/L):  its 
asymptotes  originate  at  the  left  tcxel's  midpoint,  and  have 
slopes  of  sqrt(L)  and  -sqrt(L)  (see  Figure  1).  Still  other 
modifications  and  approximations  of  this  formal  equation 
are  possible;  they  would  need  to  be  analyzed  for  accuracy 
and  computational  efficiency. 


Figure  4:  The  hyperbola  and  asymptote  methods. 

3.3  Equal,  Colinear,  and  Adjacent 

The  last  special  case  is  the  simplest,  but  perhaps  the 
most  powerful.  Suppose  that  two  colinear  and  adjacent 
image  extents  are  derived  from  two  colinear,  adjacent, 
and  equal  scene  components.  That  is.  as  in  Figure  5,  the 
points  H  and  C  have  merged.  Then  (he  constraint  given 
Tor  the  general  four-point  colinear  case  simplies  even 
further  since  B=C,  to  that  of  a  linear  constraint  in  p  and 

q: 


( 1-pa-qy  )/L  =  ( 1-pd-qy  )/H 

By  the  same  formal  method  as  above,  it  can  be 
rewritten  as: 


(X-a)/L  =  (X-d)/R,  where  X  =  (i-qy)/p 

Either  side  is  the  equation  of  a  line.  With  exactly 
the  same  flexibilities  of  the  parabola  scheme  above,  these 


Figure  5:  “Jack-knife”  method  for  vanishing  points. 

lines  can  be  plotted  in  the  image  space  (see  Figure  5). 
That  is,  they  can  extend  out  of  the  image  in  the  -z 
direction;  they  can  be  mutually  scaled;  X  can  again  be 
considered  an  image  feature,  labeling  each  position  on  the 
line  of  colinearity  with  a  two-tuple  of  features.  As  before, 
the  vanishing  point  occurs  when  the  features  are  equal; 
this  occurs  at  X=(Ld-Ra)/(E-R). 

'l  et  another  graphie  construction  is  possible.  It  too 
has  a  feature  space  interpretation,  this  time  very  useful. 
Construct  at  A  a  feature  of  value  L;  conceptually,  this  is 
constructed  by  a  line  of  length  L  perpendicular  to  the  line 
of  colinearity.  (Alternatively,  the  line  can  point  in  the  -z 
direction.)  Similarly  construct  at  D  a  feature  of  value 
R.  '1'he  resulting  figure  may  resemble  a  jack-knife,  with 
its  two  blades  opened  in  parallel,  outwards.  (As  with  a 
jack-knife,  the  blades  do  not  need  to  be  perpendicular  to 
their  base;  however,  for  the  method  to  work,  the  blades 
must  be  parallel.  The  proof  is  by  similar  triangles.)  Then 
under  this  interpretation,  the  feature  values  of  all  other 
points  on  the  line  of  colinearity  are  determined  by  linear 
extrapolation  from  the  two  given  ones.  That  is,  values 
are  generated  from  this  new  X  by  (R[X-a)-L(X-d))/(L+R). 
In  particular,  the  vanishing  point  is  where  this  image 
feature  value  is  0,  as  can  be  verified  by  direct 
substitution.  It  is  not  hard  to  show  that  this  construction 
really  does  implement  an  image  feature:  it  is  scaled 
inverse  depth. 

These  methods  are  formal;  as  with  the  parabola 
method,  other  modifications  of  the  constraint  equation  are 
possible  as  well.  It  should  be  noted  that  the  jack-knife 
equation  can  also  ne  derived  from  the  application  of 
methods  of  projective  geometry:  either  through  the  cross- 
ratio,  or  through  the  appropriate  nine-point  geometric 
construction.  'I  he  parabola  method  apparently  cannot, 
however,  as  it  deals  with  five  points  at  a  time. 

3.4  A  Reciprocal  Method 

The  jack-knife  method  has  an  interesting  extension 
The  primary  heuristic  assumption  required  for  its  use  only 
requires  that  image  extents  arise  from  equal  surface 
extents;  however,  what  is  meant  by  extent  ean  be  defined 
in  many  ways.  In  particular,  a  series  of  N  extents  laid 
colinearly  end  to  end  on  a  s  irface  ean  be  considered 
either  as  a  one  extent  of  length  N,  or  N  of  length  one  (or 
many  other  combinations).  Often,  runs  of  multiple 
extents  can  be  obtained  by  looking  for  repeated 
distinguishing  events  along  an  arbitrary  line  through  the 
image.  (Strong  edges  of  the  same  polarity,  say,  are 
events:  see  Figure  6).  The  prior  jack-knife  method  would 
try  to  normalize  the  extent  of  the  entire  run.  But  under 
the  assumption  that  the  events  form  a  texture,  the 
method  can  be  extended  to  normalize  each  event  as  well. 
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Abstract 

A  computational  framework  for  solving  the  visual  cor¬ 
respondence  problem  is  presented  and  evaluated  by  using 
a  stochastic  image  model.  The  framework  differs  from  pre¬ 
vious  work  in  that  it  emphasizes  the  combination  of  a  large 
collection  of  independent  measurements.  Partial  derivatives 
of  images  smoothed  with  a  few  different- si  zed  Gaussian 
fillers  are  suggested  as  suitable  measurements.  A  specific 
computation  is  shown  based  on  a  stochastic  image  model 
to  reliably  establish  whether  or  not  two  points  correspond, 
provided  that  the  signal  to  correspondence  noise  ratio  in 
the  images  to  be  matched  exceeds  two.  The  computation 
has  been  applied  to  artificial  and  natural  images  with  en¬ 
couraging  results. 
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1.  Introduction 

The  problem  of  matching  up  two  similar  views  of 
the  same  seere  is  one  of  the  critical  problems  which  any 
powerful  vision  system  must  solve.  Known  as  the  cor¬ 
respondence  problem,  it  occurs  most  notably  in  stereopsis 
where  the  two  views  come  from  separate  vantage  points, 
and  in  motion  analysis  where  the  two  views  come  from 
the  same  vantage  point  blit  are  separated  in  time.  In 
stereopsis,  a  solution  to  the  correspondence  problem  yields 
relative  depth  information,  while  in  motion  analysis,  it 
yields  information  which  can  be  used  to  segment  an  image 
into  regions  belonging  to  different  objects  and  ean  be 
used  to  approximate  their  velocities.  The  human  visual 
system  is  known  to  solve  both  correspondence  problems 
with  impressive  range,  resolution,  and  noise  immunity 
though  the  manner  in  which  it  does  so  is  ill  undeistood. 
Computer  solutions  to  the  correspondence  problem  have 
fallen  far  short  of  similar  performance,  particularly  on 
natural  images.  A  new  approach  to  the  problem  will  be 
presented  here  which,  it  is  hoped,  will  provide  insight  into 
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the  structure  of  the  correspondence  problem,  as  well  as  a 
robust  technique  for  solving  it. 

The  correspondence  problem  ean  be  stated  quite 
simply  as  follows.  Given  two  similar  images  of  the  same 

scene,  a  point  in  the  first  image  is  said  to  correspond 
with  a  point  in  the  second  image  if  both  are  projections 
along  lines  of  sight  of  the  same  physical  point.  The 
correspondence  problem  consists  of  trying  to  match  up  as 
many  pairs  of  corresponding  points  as  possible  given  the 
intensity  profiles  of  the  two  similar  images. 

All  algorithmic  solutions  to  the  problem  are  based 
on  the  idea  that  the  light  intensity  profiles  surrounding 
corresponding  points  are  quite  similar.  For  eaeh  point 
in  the  first  image,  only  points  in  the  second  image  with 
quite  similar  local  intensity  profiles  need  be  considered 
as  potential  matches.  If  the  similarity  measure  is  ehosen 
appropriately,  then  a  large  fraction  of  the  points  in  the 
first  image  will  have  only  one  potential  match  in  the 
seeond.  If  it  ean  be  confidently  determined  that  the 
similarity  between  these  points  and  their  potential  matehes 
is  not  due  to  chance,  then  the  unique  potential  matches 
ean  be  trusted  as  eorreet  matches.  Global  consistency 
constraints  ean  be  used  in  some  cases  to  ehoose  among 
several  potential  matehes,  but  this  may  not  be  necessary 
if  the  local  information  is  extracted  properly. 

Choosing  a  good  measure  of  similarity  for  the  cor¬ 
respondence  problem  is  quite  difficult.  Corresponding 
points  often  have  substantially  different  light  intensity 
valises  because  of  the  different  viewing  angles.  More 
importantly,  at  depth  discontinuities  in  stereopsis  or  ob¬ 
ject  boundaries  in  motion  analysis,  the  light  intensity 
values  surrounding  two  corresponding  points  ean  be  quite 
different.  When  specular  reflection  and  assorted  sources 
of  noise  are  also  considered,  it  beeomes  elear  that  the 
similarity  measure  should  be  ehosen  quite  earefully. 

Two  classes  of  similarity  measures  have  been  inves¬ 
tigated  in  the  literature.  The  first  consists  of  traditional 
statistical  measures  such  as  correlation  and  mean  square 
error  (e.g.  [Gennery  77],  [Moravec  77]).  While  algorithms 
based  on  these  measures  have  seen  some  sueeess,  their 
performance  has  been  rather  disappointing  as  a  whole. 
Exeept  under  controlled  conditions,  the  intensity  profiles 
of  corresponding  points  are  usually  not  correlated  enough 


It  docs  this  by  simply  dividing  normalized  run  extent  by 
event  count  Thus,  given  a  ru  1  of  events,  the  extended 
method  divides  it  into  two  sections,  each  with  an  image 
extent  and  an  event  count,  and  solves  the  modified 
equation 

l(  I-pa-qy)/L=r(l-pd-qy)/R,  where  1  and  r  are  event 
counts 

Note  that  the  run  can  be  split  in  many  places,  and 
that  the  modified  equation  can  be  solved  by  anv  of  the 
techniques  given  in  the  jack-knife  method  (with  L  and  R 
appropriately  modified  to  L/l  and  R/r,  respectively.)  The 
optimal  ways  to  split  the  run  would  have  to  be  analyzed. 


Figure  6:  Jack-knife  methods  on  a  wave-like  texture. 

The  jack-knife  method  is  based  on  a  measure  of 
extent-per-lexel;  this  reciprocal  method  uses  texels-per- 
extent.  The  reciprocal  method  has  many  advantages. 
The  two  sections  that,  count  events  can  be  of  fixed  image 
size  and  location.  Within  each  section,  event  counts  can 
be  recovered  by  simple  pattern  recognition  techniques. 
The  final  computation  is  simple.  In  effect,  the  shape 
constraints  uncler  this  method  come  from  simple  feature 
detectors. 

3.5  Examples  and  Comment 

The  true  beauty  of  the  jack-knife  methods  comes 
from  the  fact  that  they  are  one-step  and  robust. 

At  least  two  other  methods  for  determining  surface 
orientation  rely  on  an  implicit  searching  for  image 
‘‘regularity”;  having  found  it  they  postulate  the  vanishing 
line  to  be  parallel  to  if.  The  remaining  surface  constraint 
is  determined  by  different  means  [I3ajcsy  7G;  Stevens  79]. 
Here,  the  two  steps  are  integrated;  “tilt”  need  not  be 
found  before  “slant”,  since  any  two  vanishing  points  will 
do. 

The  jack-knife  methods  succeed  even  with  difficult 
textures  or  orientations.  As  in  the  wave  texture  of  Figure 
6,  sometimes  the  vanishing  line  direction  has  no 
measurable  regularity;  regularity-based  tilt-searches  must 
fail.  The  jack-knife  methods  will  return  a  proper 
vanishing  point,  however,  as  long  as  they  are  not  aligned 
with  the  vanishing  line.  The  jack-knife  methods  even 
work  without  search  on  frontal  ((p,q)=(0,0))  textures,  in 
which  every  direction  exhibits  image  textural  regularity. 
In  this  case,  the  jack-knife  methods  properly  return 
infinite  vanishing  points. 
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for  these  algorithms  }o  work  reliably  on  very  many  points 
in  an  image.  Difficulties  with  this  class  of  similarity 
measures  have  led  a  number  of  researchers  to  examine 
a  second  class  of  similarity  measure,  those  based  on  edge 
finding  (e.g.  [Marr  and  Poggio  79],  [Crimson  81]  [Baker 
and  Binford  81]).  These  measures  assert  that  two  points 
are  similar  if  and  only  if  they  both  lie  on  edges  of 
approximately  the  same  orientation.  More  encouraging 
results  have  been  obtained  with  these  methods,  but  im¬ 
portant  problems  remain.  Perhaps  chief  among  these  is 
the  problem  of  occlusion.  Physical  points  visible  in  only 
one  of  the  images  tend  to  get  matched  spuriously  by  edge 
based  algorithms.  Chance  matches  for  these  points  can 
frequently  be  found  and  are  difficult  to  prune.  Since  such 
points  occur  principally  at  object  boundaries,  they  are  ar¬ 
guably  the  most  important  points  to  deal  with  effectively. 

Hi  is  paper  will  describe  a  third  kind  of  similarity 
measure  for  the  correspondence  problem.  Based  on 
die  idea  of  combining  independent  measurements,  the 
measure  has  remarkable  noise  immunity  and  works  reli¬ 
ably  at  occluding  contours.  No  single  image  measurement 
in  diis  approach  is  trusted  to  indicate  very  much  about  die 
correspondence  of  a  pair  of  images.  Unanimity  among  the 
independent  measurements,  however,  is  taken  as  a  power¬ 
ful  indication  of  correspondence.  Because  information 
from  a  large  number  of  measurements  is  combined,  the 
approach  is  far  more  robust  in  a  number  of  important 
ways,  than  approaches  which  rely  heavily  on  a  very  small 
number  of  measurements.  As  a  consequence,  the  solution 
to  be  presented  here  can  be  expected  to  work  quite  well 
in  a  wide  variety  of  viewing  conditions. 

2.  Similarity  Measurement 

Let  h[x,  y)  and  /2(z,  y )  be  the  light  intensity  functions 
for  two  images  whose  correspondence  is  to  be  computed 
and  let  D[x,y)  be  the  true  offset  or  disparity  between 
the  images  measured  relative  to  the  coordinate  system  of 
h[x,y)  and  defined  on  some  set  of  points  PCS?2  for 
which  corresponding  physical  points  are  visible  in  both 
images.  Then  for  all  p  £  P,  I\ (p)  and  I2 (p  +  D(p))  are 
projections  of  the  same  physical  point.  The  problem  is  to 
recover  D  from  h  and  h. 

Measuring  similarity  can  be  thought  of  as  a  twro  step 
process.  The  first  step  is  to  create  a  representation  of 
the  local  intensity  variation  at  every  point  in  each  of  the 
two  images.  The  second  step  is  to  compare  the  local 
representations  and  determine  how  close  they  are  to  each 
other.  In  the  general  ease,  the  representation  consists  of  a 
collection  f,{p,I),  1  <  i  <  n  of  different  image  functionals 
(filters).  For  edge- based  approaches,  the  functionals  would 
measure  the  presence  or  absence  of  different  classes  of 

edges,  and  for  correlation  approaches  they  would  measure 
weighted  local  image  intensities. 

In  order  to  make  use  of  the  full  power  of  statistical 


combination,  we  need  the  functionals  to  be  both  numerous 
and  nearly  independent.  Typical  sets  of  edge  based 
functionals  are  insufficient  in  number,  and  correlation 
based  functionals  are  not  independent,  so  neither  set  is 
appropriate  for  reliable  statistic0!  inference.  The  typical 
edge-bared  functionals  could  be  supplemented  by  others, 
but  that  will  not  be  investigated  here.  Instead,  the  simplest 
interesting  class  of  functionals  —  linear  ones  —  will  be 
considered.  One  reasonable  set  of  nearly  independent 
linear  functionals  will  be  presented  in  section  5.  For  the 
moment,  assume  such  a  set  exists. 

Fach  functional  in  the  local  intensity  representation 
implicitly  defines  a  similarity  measure  for  correspondence 
since  we  expect  that  f,(p,I i)  rs  fi(p  +  D{p),I2)  provided 
the  fi  arc  chosen  carefully.  If  we  combine  the  functionals 
into  a  vector  at  each  point:  F[p,  I)  =  [{i[p,I),  f2{p,I), 

. ..,  fn[p,I ))  then  we  can  expect  the  vector  F(pi,/j)  — 
F[p2,h)  to  be  very  small  in  each  component  if  pi  and 
p2  correspond.  On  the  other  hand,  if  pi  and  P2  do  not 
correspond,  it  is  likely  that  F(pi,Ii)  —  F(p2,h)  has  at  least 
one  large  component 

'I lie  above  intuition  can  be  translated  into  an  algo¬ 
rithm  as  follows.  Define  inatchpt[pi,p2)  be  a  predicate 
which  is  true  if  and  only  if 

IA(pi.A)  —  /«(P2,f2)|  <  k,ff(fi{p, h)) 

where  o(x)  denotes  the  square  root  of  the  expected  value 
of  x2  and  let  matchp(pup2 )  be  a  predicate  w-hich  is  true 
if  and  only  if  for  all  i  £  (1,2, . . .,  n},matchpi{pi,p2). 
Then  matchp  is  true  of  a  par  of  points  pi  and  P2  if 
each  component  of  F[puh)  —  F[p2,I2)  is  smaller  than 
its  globally  determined  threshold.  It  will  be  argued  that 
matchp  does  a  good  job  of  solving  the  correspondence 
problem  if  the  /,  and  the  /c,-  are  chosen  appropriately— it 
is  almost  always  true  of  corresponding  points  and  almost 
never  true  of  non-corresponding  points. 

3.  Expected  Error  Rates 

In  order  to  evaluate  matchp ,  suppose  the  fi  are 
orthogonal  linear  shift  invariant  functionals  and  consider 
the  following  stationary  image  model.  Let  h  be  stationary 
Gaussian  while  noise  and  let  I2  be  derived  from  /] 
by  shifting  it  according  to  D(p)  and  adding  Gaussian 
white  correspondence  noise,  Ar(p).  The  efficacy  with 
which  D{x,y)  can  be  determined  from  the  /,  under  these 
conditions  depends  upon  how'  well  the  /,  are  preserved 
between  views.  Let 

SNRt=(r{fi(p,I1))/a{Mp,N)) 

be  the  signal-to-correspondence-noise  ratio  of  the  ith  func¬ 
tional.  If  SNRi  is  greater  than  two  for  a  dozen  function¬ 
als,  then  matchp  will  very  reliably  determine  whether  or 
not  two  points  correspond. 
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Three  performance  criteria  will  be  considered  for 
matchp.  The  first  is  the  rate  of  false  positives — the  prob¬ 
ability  that  matchp  w  ill  be  true  of  two  non-corresponding 
points.  The  second  criterion  is  the  rate  of  false  negatives- 
the  probability  that  matchp  will  be  false  of  two  cor¬ 
responding  points.  The  third  criterion  is  one  of  resolution 
and  concerns  the  extent  to  which  corresponding  points 
can  be  spatially  localized. 

The  calculation  of  the  false  negative  rate  is  relatively 
straightforward.  Let  p^  be  a  randomly  selected  point  in 
D.  The  difference  between  the  ith  functional  evaluated  at 
Pi  and  the  same  functional  evaluated  at  the  corresponding 
point  C(pi)  is  equal  to  the  value  of  the  functional  applied 
to  the  correspondence  noise.  A  false  negative  occurs  when 
that  difference  exceeds  the  threshold.  'lTie  distribution  of 
/,(p,  N)  is  normal  since  it  is  a  convolution  with  a  Gaussian 
process.  Hie  probability  that  it  exceeds  the  threshold  is 
the  false  negative  rate  for  matchpt  and  is  given  by 

Pr[~  mafc/ipj(pi,  C(pi))]  =  1  —  erf 

A  false  negative  oecurs  for  matchp  when  a  false 
negative  occurs  for  any  of  the  matchp t  predicates.  Since 
the  functionals  are  independent,  die  false  negative  rate  for 
matchp  is 

Pr[~  matchp{pu  C(pi))]  =  1  -  JJ  erf  ktSNRi j. 

The  false  positive  rate  is  also  easy  to  calculate.  Let 
Pi  and  p2  be  two  randomly  selected  non-corresponding 

points.  The  difference  between  /*(pi ,  /] )  and  /,(p2,A) 
is  a  normally 'distributed  random  variable  with  standard 
deviation 

A)  —  fi{P2,h))  =  V^2(/.(p,  A))  +  <r2(/i(p,  A)) 
x/2ct(/,(p,  A)) 

where  the  approximation  is  based  on  the  assumption 
that  cr(/,(p, /i))  «  <r(/,(p,/2)).  Thus  the  probability  of 
a  false  positive  based  on  the  ith  functional  is  just  the 
probability  that  a  normal  random  variable  with  the  above 
standard  deviation  has  magnitude  below  the  threshold. 
That  probability  is  given  by 


Pr[matchp(pup2)}  : 


JHf) 


due  to  the  independence  of  the  functionals. 

Suppose  SNRi  =  2  for  all  i  and  n  =  12.  Then  the 
choice  of  k,  represents  a  tradeoff  between  a  very  low  false 
positive  rate  and  a  very  low  false  negative  rate.  False 
positives  often  result  in  the  generation  of  wrong  disparity 
values  so  they  tend  to  be  quite  serious.  False  negatives, 
on  the  other  hand,  usually  result  simply  in  not  being  able 
to  determine  the  disparity  at  a  particular  point.  Thus  a 
reasonable  choice  of  k ,  is  one  which  produces  a  negligable 
false  positive  rate  while  still  keeping  the  false  negative  rate 
to  a  low  level.  One  such  choice  is  fct-  =  1.2.  The  resulting 
false  positive  rate  is  .2  per  cent  and  the  resulting  false 
negative  rate  is  18  per  cent.  Both  rates  are  a  good  deal 
lower  than  what  is  needed  to  reliably  determine  image 
correspondence.  If  the  signal  to'  correspondence  noise 
ratio  is  improved  to  three,  the  false  positive  rate  ean  be 
improved  an  order  of  magnitude  without  worsening  the 
false  negative  rate. 

4.  Expected  Resolution 

Ihe  third  criterion  of  performance  for  matchp  is  that 
of  resolution.  If  pi  is  picked  at  random  and  p2  =  C(pi)-f 
r  then  if  r  is  small  enough,  Pr[matchp(pl,p2)}  will  be 
quite  large.  The  separation  r  at  which  Pr[mafc/?.p(pi,p2)] 
becomes  small  will  determine  the  resolution  with  which 
disparity  can  be  recovered  using  matchp.  Let  A,  be  the 
autocorrelation  function  for  /,(p,  /2)  on  /2  defined  as 


Mx>y) 


Mp,  A)  *  flip,  A) 
<72(/.(p>  A)) 


where  the  asterisk  denotes  convolution.  Then  /<(C(pi),A) 
and  f,{C[p\)  -f  r,  A)  have  a  joint  normal  distribution  with 
correlation  A,[r).  The  density  of  the  distribution  is 


where  X  is  the  veetor  (A(C(pi),  A),  /.(C(pi)  +  r,  A)),  XT 
is  X  transpose  and  £  is  the  covariance  matrix: 


T) 


Pr[matchpl{pi,p2)\ 


Consider  first  the  case  where  the  correspondence  noise 
is  zero.  Then  we  are  interested  in  the  probability 


A  false  positive  for  matchp  occurs  only  when  a  false 

positive  occurs  for  each  of  the  matchpi  predicates.  Henee  Mr)  =  Pr[\Mc(Pi)>  A)— fi{C(jn)+r,  A)|  <  M(/«(p>A))] 
the  probability  of  a  false  positive  is  just 

that  one  of  the  functionals  evaluated  at  two  points  separated 
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fi 


by  r  does  not  change  enough  to  produce  differing  values 
of  matchp ,  at  those  points.  Integrating  the  joint  normal 
density  over  the  area  where  |/,(C(pi),  /2)  —  /,(C(pi)  4- 
r,h)\  <  kio{ft[p,h))  yields 


V{(r)  =  erf 


I  \ 

ki  _ 

c2v/l 


For  a  functional  whose  impulse  response  has  finite  energy, 
A,(r)  must  asymptotically  approach  zero  as  r  becomes 
large.  As  a  consequenee, 

|»n  «,(r)  =  crf(|). 

jTj  kx>  \  Z  J 

ITiis  should  come  ns  no  surprise  since  the  left  side  is  the 
probability  that  two  points  separated  by  r  differ  in  the  ith 
functional  by  more  than  the  threshold  which  should  be 
equal  to  the  false  positive  rate  for  matchp ,  in  the  limit  as 
|r|  —  ►  00. 

Now  consider  the  impact  of  correspondence  noise  on 
the  resolution.  The  probability  of  matchpt(p],C(pi)  4-  r) 
being  true  is  equal  to  the  probability  that  Si[C[pi)-\-r,h) 
falls  in  the  interval  B  =  [/i(pi,  fj)  —  mi,/i(pi,/i)  +  m^ 
where  rn,  ==  Arto-(/i(pi ,  /1 ))  is  the  threshold  for  matchp ^ 
Suppose  matchp,(pi,C(pi))  is  true.  'Ilien  /,(C(pi, /2)) 
is  by  definition  contained  in  the  interval  B.  Since  B 
has  length  2 m„  it  must  be  contained  in  the  interval 
C  —  \!AC[Pi),h)  —  2m<,  /i(C(pi),  h)  4-  2m,j.  Thus  the 
probability. that  mafc/ipi(pi,C(pi)4- r  is  true  is  less  than 
the  probability  that  fi[C[pi)  +  r,I2)  falls  in  the  interval 
C,  a  probability  that  can  be  calculated  as  before  to  be 


Fr|/i(C(p1)  +  r,/1)cC]  =  erf 


Hence,  in  the  presence  of  correspondence  noise,  the 
resolution  of  matchp.:  with  k,  —  k  for  points  which  it 
correctly  matches  can  be  no  worse  than  the  resolution  of 
matchp,  in  the  absence  of  noise  with  =  2k.  Let  r*(r) 
be  the  probability  that  matchp(p1,C(pi)-\-r)  is  true  given 
that  matchp(pi,C(pi))  is  true.  Then 


v{(r)  >  erf 


If  any  of  the  matchp ,  can  resolve  the  disparity  to 
within  r ,  then  matchp  will  also  be  able  to  do  so.  A 
conservative  estimate  of  its  resolution  is  expressed  by  the 
relation 


/V[mafc/ip(pi,C(pi)  4”  r)|matc/rp(pi,C(pi))]  — 


v  ( r) 


n prf 


(  \ 


I2/1  -Mr)) 


Functionals  whose  autocorrelation  function  fall  olf  slowly 
with  distance  from  the  origin  will  not  alfect  the  resolution 
very  much  since  their  contribution  to  the  above  probability 
will  be  multiplication  by  a  factor  near  one  in  the  area  of 
interest.  On  the  other  hand,  a  functional  with  a  sharply 
peaked  autocorrelation  function  will  strongly  alfect  the 
resolution. 


One  useful  measure  of  resolution  is  the  distance  rs 
at  which  the  probability  of  discrimination  drops  to  fifty 
per  cent.  A  very  conservative  estimate  of  rs  can  be 
produced  by  looking  only  at  the  functional  with  the  most 
strongly  peaked  autocorrelation  function  and  using  the 
conservative  estimate  developed  above  for  the  resolution 
of  a  single  functional  in  the  presence  of  correspondence 
noise.  Suppose 


.5  =  erf 


ki 


y/i-Mr), 


Then 

ki _ 2 

v/r^T3' 

Using  a  second  order  Taylor  expansion  for  A, ■(?•),  we 
obtain 


Thus  the  separation  at  which  fifty  per  cent  discrimination 
occurs  using  matchp t  is  approximately  proportional  to  the 
threshold  ki  and  inversely  proportional  to  the  square  root 
of  the  curvature  of  the  autocorrelation  of  the  ith  functional 
at  zero.  The  resolution  of  matchp  can  be  expected  to  be  a 
good  deal  better  than  the  best  resolution  of  the  matchpi. 

5.  Choice  of  Measurements 

In  deriving  the  properties  of  matchp  which  allow  it  to 
be  used  to  solve  the  correspondence  problem,  the  existence 
of  a  set  of  a  set  of  independent,  linear  shift  invariant 
functionals  whose  values  are  loosely  preserved  between 
views  was  assumed.  One  such  set  will  be  presented  here. 
If  Pi  is  a  point  in  D  and  p2  is  its  corresponding  point  then 
the  functions  7] (pi 4- r)  and  h(C(pi)-\-r)  can  be  expected 
to  be  quite  similar  for  small  values  of  r.  One  complete 

characterization  of  the  local  behavior  of  a  function  of 
two  dimensions  is  its  two  dimensional  Taylor  series,  so 
it  is  natural  to  examine  derivatives  of  /j(pi  4-  r)  and 
-f2(U(pi)4-?j.  As  one  might  expect,  first  and  second  partial 
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derivatives  appear  empirically  to  be  fairly  well  preserved 
between  views.  Diflercntiation  tends  to  aecentuate  noise, 
however,  so  it  is  usually  a  good  idea  to  do  some  low  pass 
filtering  befoie  taking  any  sort  of  derivative,  Marr  and 
Hildreth  [1980]  argue  that  the  bes*  low  puss  filter  to  use 
for  applications  such  as  this  is  a  filter  with  a  Gaussian 
impulse  response  because  it  minimizes  the  product  of 
localization  in  space  and  frequency.  I  lencc  a  reasonable 
set  of  functionals  to  look  at  is  the  set  of  derivatives  of 
Gaussian  smoothed  images. 

Not  all  derivatives  of  Gaussian  smoothed  images  are 
independent.  In  fact,  the  nth  and  n  +  2nd  derivatives  of 
Gaussian  smoothed  white  noise  are  very  strongly  corre¬ 
lated.  Hi  esc  correlations  can  be  calculated  fairly  directly. 

The  Gaussian  mask  normalized  to  have  unit  integral 
is 

/,(x,y)=_Le-(x2+r)/2as 

For  notational  convenience,  define 


The  desired  correlation  is 


Corr(/n|  m|  i<7l ,  /nj,m2,<ja) 

= _ Ufn  i  iMi  }Q\  fn2,m2,Q2  dxdy 

'JUfJlum„a,  d:cdy){J  J  flitmji0dxdy) 

where  the  integration  goes  from  negative  infinity  to  posi¬ 
tive  infinity. 

Straightforward  calculations  [Kass  82]  show  that  the 
magnitude  of  the  correlation  is  just 


C°Tr (fnum,  ,ff| ,  /nil?mli7i)  — 

n  \m-fn4-2  ,  ,  ; - - 

fgig2  j  n\m\  ?ti l77zi !n2!m2! 

ffi  +  (n/2)!(m/2)!  \  (2n1)!(2mi)!(2n2)!(2m^jl 

Ihe  following  table  gives  the  correlations  for  the  ease 
w'here  m\  —  m2  =  0  and  a,  =  a2.  Note  the  high 
correlation  between  /n,0iO  and  /n+2|0,o. 


T~ 

8  f  /dx 

d2f/dx2 

d2[/dr? 

d*fldx 

f 

i 

0. 

-.58 

0. 

.29 

djjdx 

0. 

1. 

0. 

-.77 

0. 

d2f/dx2 

-.58 

0. 

1. 

0. 

33/ /dx2 

0. 

-.77 

0. 

1. 

0. 

d'f/dx* 

.29  j 

0. 

-.85 

0. 

1. 

The  high  correlations  in  the  above  table  suggest  that  most 
of  the  usable  information  in  the  local  behavior  of  an  image 
at  a  point  p  is  contained  in  a  maximal  set  of  independent 
terms  of  the  laylor  series  around  p.  There  is  a  high 
correlation  between  and  when  n,  =  n2 

(mod  2)  and  mi  =  m2  (mod  2).  I  lence  no  independent 
set  of  laylor  series  terms  for  images  can  have  more  than 
four  elements,  one  for  each  possible  combination  of  n 
(mod  2)  and  m  (mod  2)  where  n  and  m  are  the  number  of 
derivatives  taken  in  the  x  and  y  directions.  One  maximal 
set  of  independent  functionals  is  given  by  the  following 
set  of  derivatives  of  Guassian  smoothed  white  noise. 

df  df  d*£  d2f) 
dx’  dy'  dx2  ’  dy 2  J 

A  larger  set  of  approximately  independent  functionals 
can  be  constiucted  by  considering  different  amounts  of 
Gaussian  smoothing.  If  the  ratio  between  the  standard 
deviations  of  the  two  Gaussians  is  »  =  ax/a2  then  effect 
of  the  size  of  the  Gaussians  on  their  correlation  can  be 
expressed  by  the  relation 


Corr[fni 

m,,o,  i  /n2,m2,02) 

2s  \m+"+2 
S2  +  l)  Corr{fn , 

rhe  maximum  correlation  between  two  functionals  in 
7a,  U  7 is  therefore  (2 s/(s2  -(- 1))1  which  occurs  between 
first  order  terms.  If  s  =  2,  the  correlation  is  .41  but  if  a  = 
2.5,  it  drops  to  .23  and  il  s  =  3,  it  falls  to  .13.  The  impact 

of  these  small,  non-zero  con  lations  on  the  performance  of 
matchp  can  safely  be  ignored  If  the  number  of  difierer.t 
Gaussians  is  increased  to  three,  ‘he  largest  correlation  does 
not  increase.  I  Inis  J  =  Ja  u  7as[j7asi  defines  a  a  set  of 
twelve  functionals  in  which  the  largest  pairwise  correlation 
is  still  (2s/(s2  +  1))\  If  5  is  at  least  2.5,  then  the  twelve 
functionals  in  7  will  have  sufficiently  low  correlations  to 
be  legarded  as  approximately  independent.  Since  they 
arc  all  linear  and  shift-invariant,  they  will  satisfy  all  the 
conditions  on  the  /,  used  in  deriving  the  performance  of 
matchp. 

A  conservative  estimate  of  the  probability  that  a  par¬ 
ticular  functional  will  be  unable  to  resolve  the  disparity  of 
a  point  to  better  than  an  uncertainty  of  r  was  previously 
calculated  in  terms  of  the  most  sharply  peaked  autocor¬ 
relation.  The  autocorrelation  function  of  is  just 

fn,m,o  *  fn,m,o  =  f2n,2m,V2o 

so  the  probability  that  the  functional  with  impulse  response 


fn.m,a  will  be  unable  to  localize  the  disparity  of  a  pair  of 
images  to  within  a  range  smaller  than  r  is  no  larger  than 


The  best  resolution  along  the  z-a\is  for  functionals  in  7 
occurs  with  the  functional  that  has  an  impulse  response 
equal  to  /2,o-  Its  autocorrelation  function  is 

hfi,a  *  h,Q,o  =  fA< 0  v^a 

a4  -  48gV  +  48g23 
64tt  ) 

The  probability  that  none  of  the  functionals  will  be  able 
to  resolve  the  disparity  to  within  r  is  the  product  of  the 
v,  and  measures  the  resolution  of  w.atchp. 

6.  Empirical  Performance 

As  an  initial  test,  matchp  was  applied  to  a  pair  of 
gray  level  images  generated  by  computer  with  all  the 
important  characteristics  of  the  image  model  used  here. 

Figure  1.  Matchp  applied  to  a  Julesz  random  dot  stereogram, 
Dark  points  were  unmatched 


Figure  one  shows  the  results  of  apply  mg  matchp  to 
the  above  pair  of  images.  The  dark  points  indicate  areas 
where  the  algorithm  was  unable  to  find  matches.  Slightly 
over  93  percent  of  the  pixels  in  each  image  were  uniquely 
matched.  The  mean  square  error  in  the  disparity  values 
generated  was  a  small  fraction  of  a  pixel  despite  the  large 
amount  of  correspondence  noise. 

It  is  worth  noting  that  matchp  failed  to  match  most  of 
the  points  along  the  border  of  the  shifted  central  square. 
Some  of  the  points  were  occluded  in  the  second  image,  so 
it  was  correct  not  to  match  them,  but  most  of  the  points 
went  unmatched  because  the  steep  disparity  gradient  on 
the  border  substantially  decreased  the  signal-to-noise  ratio. 
Correlation  and  edge-based  algorithms  tend  to  generate 
significant  numbers  of  incorrect  disparity  values  at  oc¬ 
cluded  regions  and  at  places  where  the  disparity  gradient 
is  large,  but  the  algorithm  based  on  matchp  avoids  doing 
so  because  of  matchp' s  unusually  low  false  positive  rate. 

Matchp  has  been  applied  to  a  small  number  of 

natural  images  as  well.  Figure  3  shows  the  results  of 
interpolating  a  surface  through  disparity  values  generated 
by  matchp  for  the  stereo  pair  in  figure  2.  Intensity  is 
proportional  to  depth.  The  photos  are  of  the  campus 
of  the  University  of  British  Columbia  and  were  obtained 
from  the  B.C.  Ministry  of  Forests. 

The  combination  of  independent  results  has  long  been 
a  favorite  method  of  statisticians.  Matchp  represents 
an  attempt  to  bring  the  power  of  this  method  to  bear 
on  the  visual  correspondence  problem.  Despite  using 
the  simplest  method  of  combination  imaginable,  matchp 
attains  a  rather  high  level  of  performance  and  so  argues 
strongly  for  the  applicability  of  this  statistical  tool  to  the 
correspondence  problem. 
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7 *  was  used  as  the  set  of  functionals.  The  image  pair  has  a 
signal-to-correspondence-noise  ratio  of  two  and  a  disparity 
field  which  is  zero  everywhere  except  in  a  central  square 
covering  one  ninth  of  the  image  area  where  it  is  (6,0)  in 
pixels.  For  each  point  (x,y)  in  the  left  image,  matchp(x  + 
n,y),n  £  {— 8,  — 7,  —  6, . . .  ,6, 7, 8}  was  calculated.  If 
there  was  only  one  n  such  that  matchp[x  +  n,y)  was 
true,  n  was  recorded  as  the  disparity  value.  If  there  was 
more  than  one  n  such  that  matchp(x  -f-  n,y)  was  true, 
the  disparity  was  recorded  as  ambiguous.  If  there  was  no 
n  such  that  matchp(x  -f  n,  y )  was  true,  the  disparity  was 
recorded  as  unknown. 


I  inure  2.  I  uiversity  <>l  Hi iiisli  (  olinuhia  from  the  air 


hp  output.  Intensity  is  propm t iminl  to  depth. 
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ABSTRACT 

This  paper  describes  two  motion  estimation 
algorithms.  The  first  makes  use  of  the  scatter- 
gram  of  motion  vectors  to  guide  local  smoothing, 
while  the  second  is  based  on  a  multiresolution 
("pyramid")  image  representation. 


1.  INTRODUCTION 

In  this  paper  we  describe  two  algorithms  for 
motion  estimation.  The  first  is  an  adaptation 
of  the  grey  level  enhancement  algorithm  called 
"superspike"  to  the  problem  of  motion  estimation, 
while  the  second  is  a  multiresolution  (i.e. , 
pyramid)  motion  estimation  algorithm.  Section  2 
describes  the  motion  superspike  algorithm  (more 
details  can  be  found  in  Xie  et  al.  [1]),  and 
Section  3  presents  the  multiresolution  algorithm 
(more  details  can  be  found  in  Wohn  et  al.  [2]). 

2.  MOTION  SUPERSPIKE 

This  section  describes  an  adaptation  of  an 
image  enhancement  algorithm  called  the  "superspike" 
algorithm  to  motion  field  enhancement.  In  contrast 
to  most  motion  estimation  and  enhancement  algori¬ 
thms  which  rely  solely  on  local  information  in  the 
image,  the  superspike  algorithm  utilizes  global 
information  about  the  motion  field,  derived  from 
a  histogram  of  the  x  and  y  components  of  the  esti¬ 
mated  motion.  The  incorporation  of  global  infor¬ 
mation  leads  to  both  more  accurate  and  precise 
estimates  of  motion. 

Superspike  was  introduced  in  [3]  as  an  en¬ 
hancement  algorithm  for  grey  scale  images;  a  gen¬ 
eralization  to  color  was  presented  in  [4-6]. 
Superspike  is  an  iterative  algorithm  which  at  each 
iteration  replaces  the  grey  level  (or,  more  gen¬ 
erally,  spectral  vector)  of  a  pixel,  P,  by  the 
average  grey  level  of  a  subset  of  the  pixels  in 
some  fixed  size  neighborhood  of  P.  A  neighbor  Q 
is  included  in  this  averaging  subset  if: 

1)  the  grey  level  gQ  at  Q  is  in  the  same  his¬ 
togram  cluster  as  the  grey  level  gP  at  P 
(this  is  ordinarily  determined  by  check¬ 
ing  that  the  histogram  values  between 


gP  and  gQ  are  monotonic ;  some  smoothing 
of  the  histogram  is  necessary  to  avoid 
being  misled  by  local  peaks  and  valleys) ; 
and 

2)  the  value  of  the  histogram  at  gQ  is  higher 
than  that  at  gP  -  i.e.,  gQ  is  a  more 
probable  grey  level  in  the  image  than  gP. 

Of  course,  each  iteration  of  the  algorithm 
changes  the  global  grey  level  distribution,  al¬ 
though  after  only  a  few  iterations  there  are  ordi¬ 
narily  not  many  changes  in  the  neighborhood  sub¬ 
sets  of  pixels  that  are  used  to  determine  the  new 
grey  levels.  Empirically,  the  result  of  applying 
the  superspike  algorithm  is  that  the  grey  level 
histogram  is  reduced  to  a  small  number  of  spikes ; 
this,  of  course,  makes  it  trivial  to  segment  the 
image  into  homogeneous  regions .  For  a  more  de¬ 
tailed  description  of  the  algorithm,  see  [3-6]. 

Ic  is  possible  to  modify  the  superspike  al¬ 
gorithm  so  that  it  can  be  applied  to  motion  field 
enhancement.  We  will  describe  the  modifications 
necessary  for  applying  it  to  enhancing  motion 
fields  where  the  motion  is  constrained  to  be  trans¬ 
lation  in  the  image  plane.  It  is  also  possible,  in 
principle,  to  deal  with  image  plane  rotations  and 
zooms;  however,  we  were  not  successful  in  obtaining 
useful  segmentations  for  more  general  motions  even 
when  analyzing  carefully  controlled  motion  se¬ 
quences  . 

We  assume  that  we  are  given  a  motion  field,  M, 
which  specifies  the  x  and  y  components  (u  and  v) 
at  each  pixel.  Since  the  superspike  algorithm 
requires  a  relatively  dense  motion  field,  the 
original  motion  vectors  are  computed  using  a  dif¬ 
ferential  technique  such  as  described  in  [7-9]. 

The  u  and  v  components  of  motion  are  used  to 
construct  a  two-dimensional  histogram  (or  scatter- 
plot)  of  M,  and  this  histogram  is  smoothed  over 
10x10  neighborhoods  using  simple  unweighted  aver¬ 
aging  to  eliminate  spurious  peaks  and  valleys. 

Given  a  point,  P,  in  M,  we  choose  a  subset  of  the 
points  in  a  3x3  neighborhood  of  P  to  compute  the 
new  u  and  v  motion  components  at  P.  This  subset 
is  chosen  using  the  same  algorithm  employed  by  the 
multispectral  version  of  superspike.  Finally,  the 
u  and  v  components  of  P  are  averaged  with  those  of 
the  pixels  in  the  selected  subset.  Once  these  new 
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components  are  computed,  the  two-dimensional  his¬ 
togram  is  recomputed  and  smoothed,  and  the  process 
can  be  iterated. 

Figures  1-2  contain  an  example.  Figure  ] 
shows  frames  1  and  5  of  a  natural  motion  sequence. 
Figure  2  shows  the  histograms  of  the  x-  and  y- 
components  of  motion  both  originally  and  after  5 
iterations  of  motion  superspike.  The  two  (tri¬ 
vially  segmentable)  peaks  in  the  latter  histograms 
correspond  closely  to  the  actual  (hand-calculated) 
motion  of  the  cars  .  Further  examples  can  be  found 
In  [1]. 

3.  MULTIRESOLUTION  MOTION  ESTIMATION 

In  this  section  we  present  a  very  brief  de¬ 
scription  of  a  pyramid-based  motion  estimation  al¬ 
gorithm,  and  present  one  example  of  its  applica¬ 
tion  to  a  motion  sequence.  More  details  on  the 
algorithm  can  be  found  in  [2], 

Given  a  time  varying  image  sequence,  we  con¬ 
struct  a  grey  level  pyramid  for  each  frame  in  the 
sequence  using  median  sampling.  The  grey  level 
pyramids  are  overlapped  pyramids,  so  that  each 
pixel  at  level  i  has  four  fathers  at  level  i+1. 

For  the  middle  frame  and  for  each  level  of  the 
pyramid,  we  compute  an  initial  estimate  of  the 
motion  using  a  gradient-based  algorithm.  The 
motion  estimate  at  level  i  is  computed  by  con¬ 
sidering  the  set  of  images  at  level  i  of  the 
pyramid  as  a  (reduced  resolution)  motion  sequence, 
assuming  that  the  image  motion  is  locally  a  two- 
dimensional  translation,  and  then  computing  a 
least  squares  estimate  of  the  motion  at  each 
pixel  based  on  the  normal  component  of  motion 
in  a  neighborhood  of  the  pixel. 

These  reduced  resolution  motion  fields  are 
then  organized  into  an  overlapped  pyramid  by  link¬ 
ing  each  node  at  level  i  to  that  father  at  level 
i+1  whose  motion  estimate  is  closest  to  its  motion 
estimate.  The  next  step,  which  is  the  most  cru¬ 
cial,  segments  the  pyramid  into  subpyramids  that 
cover  the  original  image.  The  apex  of  each  sub¬ 
pyramid  is,  in  a  sense  described  below,  the  most 
"reliable"  estimate  for  the  motion  of  the  pixels 
at  the  base  of  that  subpyramid  (ignoring  edge 
effects),  and  its  motion  properties  (i.e.,  the 
pattern  of  motion  estimates  in  a  small  neighbor¬ 
hood  of  the  apex)  are  used  to  adjust  the  flows  of 
the  other  nodes  in  that  subpyramid.  The  apices 
of  the  subpyramids  are  determined  as  follows. 

A  node,  f,  in  the  pyramid  dominates  one  of  its 
sons,  s,  if 

1)  The  magnitude  of  the  flow  at  s  is  less 
than  the  magnitude  of  the  flow  at  f; 

2)  The  second  derivative,  with  respect  to 
resolution,  of  the  flow  at  f  is  less 
than  the  corresponding  second  derivative 
of  the  flow  at  s;  and 

3)  The  predictive  coding  error  for  the  area 
of  the  picture  around  s  and  f  is  less 
using  the  motion  estimate  at  f  than  it  is 
using  the  motion  estimate  at  s. 


Very  briefly,  the  motivations  supporting  these 
criteria  are:  For  (1),  if  the  extent  of  the 
smoothing  used  to  compute  the  spatial  gradient  of 
the  time-varying  image  is  less  than  the  extent 
of  the  motion,  then  gradient-based  techniques  tend 
to  underestimate  the  image  motion.  For  (2),  if 
the  motion  at  a  pixel  is  large,  then  the  gradient- 
based  motion  estimates  as  a  function  of  spatial 
smoothing  increase  (relatively)  linearly  to  the 
correct  estimate,  remain  stable  for  a  while,  and 
then  change  unpredictably .  The  second  derivative 
of  flow  with  respect  to  the  resolution  will  be 
large  initially,  smallest  during  the  period  of  sta¬ 
bility,  and  then  large  again.  For  (3),  since  the 
gradient-based  motion  estimation  algorithms  assume 
that  the  grey  level  is  an  invariant  to  the  motion, 
the  motion  field  should,  in  principle,  be  a  per¬ 
fect  estimator  of  intensity  in  subsequent  frames. 

This  notion  of  dominant  nodes  naturally  leads 
to  a  segmentation  of  the  pyramid  into  subpyramids 
whose  apices  are  the  ends  of  the  longest  chains  of 
dominant  nodes  (starting  from  the  level  above  the 
base).  Once  the  apices  have  been  determined,  a 
top-down  process  in  each  subpyramid  adjusts  the 
motion  estimates  at  all  nodes  in  the  subpyraraid. 

At  each  level  (and  starting  at  the  apex)  and  for 
each  node  at  that  level  we  compute  the  divergence 
and  curl  of  the  motion  field  in  a  neighborhood  of 
the  node,  and  then  adjust  the  motion  estimates  of 
the  sons  so  that  the  neighborhoods  of  the  sons 
have  the  same  divergence  and  curl  as  that  of  the 
father.  For  details  see  [2]. 

Figure  3  contains  an  example  of  the  multi¬ 
resolution  algorithm  applied  to  the  motion  sequence 
in  Figure  1.  Since  this  sequence  contains  two  ob¬ 
jects  moving  at  very  different  speeds,  the  motion 
estimates  that  are  obtained  using  a  fixed  reso¬ 
lution  are  not  equally  reliable  for  both  objects 
(the  speed  of  the  faster  object  is  consistently 
underestimated,  and  the  directions  are  very  unre¬ 
liable).  Figure  3a  shows  the  original  motion  esti¬ 
mates  and  Figure  3b  shows  the  results  using  the 
algorithm. 
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Abstract 

•v 

The  aim  of  this  paper  is  to  show  that  a  wide  variety  of 
perceptual  phenomena  have  succinct  explanations  given 
the  concept  of  frame  primitives.  A  frame  primitive  is  a 
local  coordinate  frame  attached  to  certain  more  primitive 
perceptual  data,  The  idea  of  assigning  coordinate  frames  to 
objects  and  the  role  of  such  frames  in  gestalt  phenomena  is 
appreciated  but  has  not  been  extensively  modeled.  We 
show  that  an  implementation  of  frame  primitives  in  a 
parallel,  connectionist  model  has  special  virtues  in 
understanding  many  aspects  of  perceptual  dynamics.  < 

1.  Introduction 

This  paper  has  two  objectives:  to  provide  a  general 
model  for  the  problem  of  form  perception  and  to  describe 
the  model  in  terms  of  a  connectionist  formalism. 

Why  Connectionism? 

Computational  models  of  vision  have  stressed  the 
description  of  shapes,  rather  than  the  perception  of  shapes. 
The  first  problem  tends  to  focus  on  the  invertibility  of  the 
representation,  i.e.,  the  reconstruction  of  the  shape’s 
surface  from  the  underlying  description.  The  second 
problem  centers  around  the  computability  of  the 
representation,  particularly  in  the  presence  of  noise  and 
occlusion. 

If  computability  is  stressed,  then  the  choice  of  machine 
architecture  becomes  paramount.  Much  of  form 
perception  has  been  cast  in  terms  of  information 
processing  models,  almost  exclusively  based  on  the  notion 
of  computation  as  that  carried  out  by  a  sequential  Von 
Neumann  machine  [Hallard  and  Brown,  1982),  However, 
this  model  has  many  drawbacks  as  a  model  of  human 
perception.  Animal  brains  do  not  compute  like  a 
conventional  computer  Comparatively  slow  (millisecond) 
neural  computing  elements  with  complex,  parallel 
connections  form  a  structure  which  is  dramatically 
different  from  a  high  speed,  predominantly  serial  machine. 
Much  of  current  research  in  the  neurosciences  is 
concerned  with  tracing  out  these  connections  and  with 
discovering  neural  unit  responses  to  complex  stimuli. 
However,  a  crucial  next  step  is  to  characteri/e  neural 
function  at  a  higher  level  than  single  units  barlier 
connectionist  models  [Hebb,  1949;  McCulloch  and  Hitts, 
1943;  Rosenblatt,  1958]  were  a  step  in  this  direction,  but  at 
the  time  those  ideas  were  formed,  the  knowledge  of  the 
brain  was  much  less  than  it  is  now. 


Connectionism  is  the  only  current  model  that  can 
stand  the  crucial  test  of  timing.  That  is,  given  that  entire 
behavioral  responses  can  be  realized  in  100  milliseconds, 
connectionism  seems  to  be  the  only  way  to  construct 
plausible  models  in  terms  of  neural  units  that  can  achieve 
these  response  times.  Previous  papers  have  suggested  how 
connectionist  theories  of  the  brain  can  be  used  to  produce 
testable,  detailed  models  of  interesting  behaviors 
[leldman,  1981a;  Ballard,  1983;  beldman  and  Ballard, 
1982],  These,  and  work  by  Hinton  [1981a;  1981b]  and 
I  'ahlman  1 1 979]  have  served  to  shed  light  on  connectionist 
architectures,  but  knowledge  of  the  potential  of  such 
constructs  is  still  in  an  embryonic  stage.  By  tackling  hard 
problems  such  as  form  perception  we  hope  to  shed  light 
on  both  form  perception  and  connectionist  models. 

Distributed  Compulation 

Shape  description  has  favored  centralized 
representations.  Examples  of  such  work  are  generalized 
cylinders  [Agin  and  Binford,  1976;  Kanude,  1981;  Sham, 
1981],  spherical  harmonics  [Schndy,  1982],  3d 
generalizations  of  the  medial  axis  transformation  [Badler 
and  O’Rourke,  1977],  and  polyhedral  models  |Brown, 
1981],  Usually  these  representations  are  described  with 
respect  to  a  single,  orthogonal  frame.  The  exception  is  the 
curvilinear  frame  used  in  generalized  cylinders.  In 
constrast,  we  arc*  interested  in  lecoguizing  complex  objects, 

whose  representations  are  decomposable  and  distributed 
among  many  related  coordinate  frames.  In  fact,  we  adopt  a 
radical  view:  most  of  our  representation  of  shape  consists 
only  of  coordinate  frames.  The  approach  of  a  distributed 
structural  description  has  been  defended  in  the 
psychological  literature  [Hinton,  1979;  Palmer,  1977]  and 
widely  used  (see,  for  example,  Mart  and  Nishihara  [1978], 
Shapiro  et  al.  [1982],  and  Brady  and  Wielinga  1 1978])’ 
More  specifically,  the  representation  is  a  set  of  shape 
constraints  which,  when  combined,  specify  a  particular 
form, 

The  representation  of  a  shape  as  a  set  of  distributed 
constraints  has  several  advantages:  (1)  a  large  number  of 
different  shapes  can  be  represented  compactly,  owing  to 
the  combinatorics  of  the  distributed  constraints;  (2)  the 
shape  can  be  quickly  computed  by  the  parallel  propagation 
of  partial  constraints;  and  (3)  partial  constraints  can  be 
computed  independently.  In  addiuon,  our  connectionist 
model  meshes  well  with  such  a  representation  since  the 
architectural  structure  prefers  distributed  constraints,  and 
the  computational  method  is  naturally  insensitive  to 
occlusion  and  noise. 
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Another  advantage  of  viewfrumes  is  that  they  can  be 
(at  least  in  principle)  computed  directly  from  snrlace 
markings  of  lines  and  points  and  do  not  depend  on 
assumptions  of  continuity  and  smoothness,  thus  the 
representation  avoids  the  objections  raised  about  intrinsic 
images  by  [Witkin  and  Tenenbaum,  1983], 

The  language  for  expressing  the  shape  model  is  that  of 
parameter  nets  [Ballard,  1983;  Feldman  and  Ballard,  1982), 
The  distributed,  connectionist  model  consists  of  networks 
of  the  following  entities; 

(a)  Viewfrumes.  These  units  represent  possible  viewer- 
centered  coordinate  frames  for  perceiving  a  shape, 

(b)  Model-frames.  A  pattern  of  activity  in  these  units 
represents  a  possible  object-centered  frame  that  is 
used  to  describe  a  form. 

(c)  View  parameters.  These  units  represent  values  of 
scale,  rotation,  and  translation  that  are  appropriate 
for  specifying  the  relationship  between  the  viewer- 
centered  and  object  centered  frames  ((a)  and  (b)). 

(d)  Model-identifier  nodes.  These  units  represent  object 
tokens  such  as  "horse,"  "horse's  back,"  "bicycle," 
etc,,  in  a  non  geometric  relational  description  of 
object  structure. 

(e)  View* stable  features.  These  units  represent  shape 
parameters  that  are  relatively  independent  of  small 
changes  in  the  viewing  frame  (a),  Fxamples  would 

[Wbf  hlockii  worfci  ^°‘nt  types  lised  by  KilliaJe 

(0  I'ocus-of-attention  parameters.  These  units 
represent  discrete  changes  of  attention  between 
model-identifier  nodes,  l  or  example,  a  focus  of 
attention  (FOA)  node  might  switch  attention  from 
the  horse’s  back  (=  back  node  high  confidence)  to 
the  horse’s  neck. 

(g)  C  Image- of- view  parameters.  These  units  provide  the 
basis  for  changing  the  view  transformation  (c). 

(It)  Shape  parameters.  These  units  specify  local  shape 
information  relative  to  frame  parameter  units, 
Many  choices  of  these  parameters  are  possible 
since  these  are  the  parameters  encountered  in 
holistic  shape  representations. 

Ihese  parameter  sets  form  a  basis  for  describing  a  model 
of  human  shape  perception.  The  next  step  is  to  describe 
the  constraints  between  them.  I  he  fundamental  premise  is 
that  these  constraints  must  be  restricted  to  subsets  of 
entities.  1  he  practical  reason  for  this  is  one  of 
combinatorics;  if  constraints  involving  many  different 
kinds  of  units  are  allowed,  it  becomes  impossible  both  to 
represent  them  as  connections  and  stay  within  the 
biologically  plausible  limitation  of  104  connections  per 
unit.  However,  a  key  point  is  that  a  set  of  restricted 
constraints,  when  taken  together,  can  imply  a  larger 
constraint.  I  he  constraint  relations  we  will  need  are  the 
following: 

(A)  View-Transformation  (viewframes,  model-frames, 
view  parameters).  This  relationship  is  the 
geometric  transformation  that  relates  the  two 
different  frames  of  reference. 


(B)  Relational  Constraints  (model  identifiers,  model 
frames),  these  kinds  oi  constraints  are  those 
explored  by  Shapiro  [Shapiro  et  al„  1982]  that 
relate  model  identifiers  to  ranges  of  model  frame 
units. 

(C) (  haracteristie  Views  (view  stable  features,  model 
identifiers,  view  parameters).  This  relationship 
relates^  shape  primitives  directly  to  model 
identifiers  without  involving  detailed  geometry. 
For  example,  if  we  know  we  are  looking  at  the  side 
of  a  horse,  we  can  expect  certain  shape  features 
[Feldman,  1982], 

(I))  l  ocus  Switching  (model-frames,  FOA  parameters, 
change-of-view  parameters).  This  relationship 
specifies  possible  changes  of  focus.  For  example, 
any  frame  unit  in  (b)  may  become  the  viewfraine 
by  activating  the  appropriate  change-of-view 
parameter. 

(F)  View  transformation  Switching  (view  parameters, 
change-of  view  parameters).  This  relation  handles 
the  view  transfoim  part  of  (I)). 

Ihe  ensuing  sections  develop  the  motivation  for  these 
choices  of  entities  and  relations.  While  this  set  of 
constraints  is  constantly  undergoing  revision,  we  think  it 
provides  a  workable  taxonomy  with  which  to  explore 
interesting  issues  in  shape  perception. 

Outline 

The  discussion  of  these  constraints  starts  from  the  most 
pr  m live  elements;  subsequent  ideas  are  presented  in 
o.der  of  increasing  complexity.  For  simplicity,  the 
examples  are  limited  to  two  dimensions,  but  this  is  not  too 
serious  a  limitation.  The  3  d  versions  lor  some  of  the 
relations  have  already  been  developed  [Ballard  and 
Sabbah,  1981;  Marr  and  Nishihara,  1978J  and  the  2  d 
results  are  applicable  to  boundary  contours,  an  important 
subcase  of  the  general  3;d  problem.  We  first  discuss  the 
concept  of  frame  primitives  which  we  term  viewframes. 
Viewframes  can  be  extracted  from  image  data  by  simple 
rules  combined  with  relaxation  [/ucker,  I980J  and  Hough 
techniques  [Ballard,  1983]  An  important  adjunct  to  the 
view  frame  concept  is  that  of  space- time  processing.  Rather 
than  having  separate  computations  for  spatial  and 
temporal  patterns,  they  are  combined  in  a  single 
processing  network  with  resultant  space  savings.  Next  we 
describe  the  concept  of  a  view  transform.  The  view 
transform  has  been  described  m  computational  terms  as  a 
generalized  Hough  technique  and  has  been  put  into 
connectionist  leims  by  Hinton  [1981c],  The  view  transform 
is  an  important  kernel  of  any  computational  shape 
perception  model  as  it  economically  relates  abstractions  of 
image  data  (viewframes)  to  model  frames.  Further 
economics  arise  because  the  view  transform  has  an 
important  decoupling  properly.  Two  of  its  four 
parameters,  scale  and  rotation,  can  be  computed 
independently  of  the  other  two,  x  translation  and  y 
translation  [Ballard  and  Sabbah,  1981],  This  leads  to  the 
concept  of  representing  the  view  transform  in  a 
connectionist  model  as  split  parameter  spaces  (parameter 
subspaces). 
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1  he  geometric  constraints  captured  by  the  view 
transform  arc  insufficient  to  relate  any  subset  of  image 
data  to  one  of  a  large  competing  set  of  stored  models.  I  he 
rest  of  this  paper  explores  solutions  to  this  problem.  I  lie 
simplest  is  the  concept  of  using  the  view  transform  in 
special  modes  which  are  more  constrained  than  the  general 
case.  We  describe  top-down  mode  (looking  lor  a  given 
object),  bottom- up  mode  (identifying  a  segmented  object), 
and  tracking  mode  (looking  at  a  segmented  object  over 
time). 

Since  the  view  transformation  is  undereonstrained,  a 
natural  solution  is  to  find  additional  constraints  that  tire 
not  geometry- based  (e.g..  color),  and  that  can  be  derived 
from  the  image  independently.  Many  possibilities  tire 
discussed  in  [llallard,  1983;  beklman,  198?];  we  will  not 
pursue  these  possibilities  here.  I'he  most  ambitions 
solution  to  the  undereonstrained  nature  of  the  view 
transform  is  to  find  some  hierarchical  indexing  scheme  tor 
the  model  shapes  so  that  they  do  not  all  have  to  be 
considered  at  once.  Such  a  scheme  requires  a  sequential 
scanning  mechanism  to  move  back  and  forth  between 
levels  of  abstraction,  We  describe  such  a  mechanism  in 
terms  of  the  formalism  and  relate  it  to  larlms’s  scanning 
results  with  human  subjects  [larbus,  1967], 

2.  Viewframes 

Representations  for  geometrical  objec's  are  usually 
greatly  simplified  if  an  appropriate  coordittale  frame  is 
chosen.  I  he  case  is  even  stronger  for  articulated  objects 
[Marr  and  Nishihui a,  1978],  I'he  fact  that  geometric 
representations  simplify  if  appropriate  coordinate  frames 
are  chosen  agrees  well  with  the  many  human  perceptual 
results  that  suggest  frame-dependencies  (e.g.,  [Hinton, 
1979J). 

Computing  good  coordinate  frames  for  complex 
objects  is  in  general  difficult,  although  some  progress  is 
being  made.  IWady  [198.3]  shows  that  the  boundary 
features  of  objects  suggest  logical  possibilities.  Also,  he 
argues  that  rather  than  a  single  fratne,  one  should  think  of 
a  hierarchy  of  possibilities,  where  different  levels  in  the 
hierarchy  depend  on  different  levels  of  surface  detail  and 
locality.  (Similar  points  were  made  in  [Marr  and  Nishihara, 

1978],  bill  the  emphasis  was  less  on  computability.) 

While  the  notions  of  coordinate  frame  are  intuitive  for 
segmented  objects,  it  may  be  less  obvious  that  they  are  an 
essential  component  of  the  descriptions  behind  a  variety  of 
gestalt  grouping  phenomena  (figure  1).  Our  model  for  all 
these  phenomena  depends  on  frame  primitives,  which  we 
have  termed  viewframes.  Viewframes  may  be  strongly 
suggested  by  shape  contours  (e.g.,  [firatly,  1983|)  or  they 
may  be  only  weakly  suggested  by  primal  tokens.  In  these 
latter  cases,  vrewfrumes  are  characterized  as  being  the 
essence  of  the  description,  rather  than  an  indexing 
mechanism  for  surrounding  complex  geometry. 

I'he  crucial  issues  in  describing  viewframes  are  how 
they  are  represented  and  how  they  are  computed.  Resides 
issues  of  abstract  computability,  it  is  important  that 
viewframes  be  computable  in  terms  of  connections. 
Consider  the  2  d  ease:  a  complete  set  of  parameters  is 


specified  by  the  location  of  an  origin  x,y,  a  rotation  0,  and 
a  scale  s,  and  these  parameters  describe  the  relation  ol  a 
local  viewframe  to  a  global  image  trame  This  4  d  space  of 


parameters  covers  all  possible  local  frames,  lu  represent 
this  space,  discretize  it  using  some  ax.  Ay,  At),  and  As,  and 
assign  each  resultant  discrete  cell  a  value  unit.  (Ways  of 
economizing  on  the  number  of  units  will  be  described  in 
Section  3.) 

brume  space  can  represent  all  the  frames  that  could  be 
present  (at  the  resolution  level  chosen),  but  only  a  fraction 
of  those  will  usually  be  present  in  a  given  image, 
furthermore,  of  these,  many  can  be  ruled  out  as  being 
inconsistent  on  the  basis  of  local  and  global  frame 
grouping  rules.  Thus 

representable  frames  >  possible  image  frames 
>  consistent  image  frames 

In  the  ensuing  paragraphs  we  will  describe  the  process  of 
computing  consistent  image  frames  in  more  detail.  In  the 
process  we  attempt  to  synthesize  a  unifying  explanation 
oc‘.  of  our  own  previous  work  [llallard,  1981;  llallard  and 
Sabbah,  1981]  as  well  as  that  of  others  [Stevens,  1981; 
Zucker,  1980],  I'he  reason  for  attempting  a  synthesis  is  that 
the  frame  description  is  necessary  as  a  primitive  for  all  our 
subsequent  work,  and  that  important  precursors  have 
appeared  that  do  not  explicitly  acknowledge  frames 
[Zucker,  1980]  or  that  (from  our  point  of  view)  miscast  the 
frame  assignment  problem  as  a  correspondence  problem. 

Rules  for  Frame  Suggestion 

We  assume  that  the  visual  environment  has  been 
lokemzed  in  some  way,  e.g.,  collections  of  points  and  edges 
that  may  or  may  not  be  moving.  Out  of  this  primitive 
structure,  the  next  useful  level  of  abstraction  concerns  the 
suggestion  of  frames.  I'he  initial  suggestion  of  frames 
corresponds  to  initial  levels  of  activation  of  units  in  frame 
space.  These  levels  may  be  raised  or  lowered  depending  on 
the  surrounding  context  of  nearby  frames.  I'he  rules  lor 
frame  suggestion  are  summarized  in  figure  2  !  liese  rules 
are  based  on  a  2  d  model. 
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A  curve  has  a  natural  local  frame  which  is  us  tangent, 
I  he  tangent  specifies  a  frame  origin  locus  and 
orieniation  locus,  leaving  scale  undetermined.  With 
this  case  and  the  others  to  follow,  two  orientations  ot 
the  li'time  are  possible,  one  at  0  and  the  other  at  O-f  w. 

I  ine  segments  are  similar  to  curves  except  that  there  is 
a  natural  choice  for  scale  which  is  the  length  of  the 
line. 


0  Correspondence 

two  points  have 

*- 

natural  constraint  of 

frame  that  aligns  with 
line  joining  points 

Points  suggest  a  frame  origin  but  leave  scale  and 
rotation  undetermined. 

Moving  points  have  a  natural  frame  parallel  to  the 
direction  of  motion.  The  velocity  suggests  a  value  for 
scale. 

I  wo  points  have  a  natural  frame  whose  orientation  and 
scale  are  determined  by  the  line  joining  the  points  and 
whose  origin  is  located  at  the  leftmost  (with  respect  to 
the  Irame)  point. 

Wheie  the  tokens  are  not  the  aforementioned 
primitives,  but  rather  complex  shapes,  these  rules  may  still 
apply  between  the  local  frames  of  the  more  complicated 
shape  tokens.  Complex  shapes  seem  to  have  their  own 
rules  for  their  local  frames  [Brady,  198.1], 

Hides  for  Frame  Assignments 


x,y,0  defined  by 
curve;  s  undetermined 


two  possible  choices 
of  x,y,U;  s  determined 


b)  I. ine  Segments 


c)  Points 

,...v  x,y  determined;  s,0 

free 

d)  I.ong  Line  Segments 


long  line  segments  can 
be  viewed  as  having 
points  at  their  ends; 
rule  for  point  frames 
applies 


e)  Moving  Points 


moving  points  have 
natural  frame  parallel 
to  direction  of  motion; 
curvilinear  case  is 
analogous  to  moving 
along  boundary  con¬ 
tour;  x,y,0  deter¬ 
mined;  s  free 


Figure  2. 


irame  Relaxation 

l'rame  units  have  an  associated  activation  level  which 
ranges  between  zero  and  one.  An  activation  level  signifies 
whether  or  not  the  frame  unit  is  part  of  the  current  gestalt. 
Ihus  a  frame  unit  which  is  initially  activated  by  lower 
level  input  may  have  its  activation  reduced  if  it  is 
inconsistent  with  neighboring  frame  units  in  us  surround. 
Methods  for  increasing  or  decreasing  activation  have  been 
previously  developed,  Zucker  [1982]  has  exactly  the  right 
kind  of  algorithm  for  refining  frames  based  on  purely  local 
evidence.  In  that  multilevel  relaxation  scheme,  pairs  of 
points  suggest  line  segments  (rule  0  and  nearly  colinear 
line  segments  can  increase  each  other’s  activation  level, 
(io  translate  from  relaxation  labeling  to  conneciionist 
relaxation,  make  a  unit  for  every  label  and  let  the 
probability  of  a  label  le  its  activation  level  [Ballard.  1983]  ) 
Similar  methods  based  on  coi  respondenee  have  been  used 
[Ullman,  1979;  Barnard  and  Thompson,  1979],  but 
correspondence  leads  to  problems  if  taken  too  literally, 
since  non-correspondences  owing  to  noise  and  occlusion 
have  damaging  effects. 

'I  he  best  way  for  dealing  with  local  frame  coherence  is 
to  look  at  the  mode  of  the  distribution  of  local  frame 
parameters  [Stevens,  1981],  [his  allows  frame  coherence  to 
emerge  from  high  levels  of  ambiguity.  Cognoscenti  will 
recognize  this  method  as  a  version  of  the  Hough  transform 
[Ballard,  1983], 

Besides  local  frame  coherence,  there  is  also  the  global 
frame  coherence  found,  for  example,  in  glass  patterns,  If 
identical  spotted  overlays  are  rotated  a  small  amount  with 
respect  to  each  other,  a  global  concentric  pattern  is  seen. 

I  his  global  pattern  can  be  explained  Ivy  postulating  a 
parameter  space  that  explicitly  represents  parametric 
variations  in  the  pattern.  Iiach  local  frame  raises  the 
activation  of  units  in  the  parameter  space  that  are 
compatible  with  itself  In  this  case  the  global  units 
represent  rotation  center  coordinates  and  the  local  frame 
raises  the  activation  level  of  (metaphorically:  votes  for) 
parameter  units  on  a  linear  locus  perpendicular  to  the  x 
axis  of  the  frame  (see  Figure  3).  Note  that  the  Hough 
transform  model  for  computing  the  rotation  center 
provides  a  mechanism  for  selecting  the  mode  of  the 
activated  units.  The  many  spurious  activations  that  arise 
from  suggested  frames  that' are  not  part  of  the  global 
pattern  are  spread  among  very  disparate  units,  and  thus 
can  be  discounted  via  local  inhibition.  Other  methods  that 
describe  this  construction  (eg.,  [Hildreth,  1983])  do  not 
acknowledge  the  above  problem  in  applying  it. 
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A  viewframe  is'  an  abstraction  that  can  arise  from  more 
primitive  stimuli  which  may  be  either  spatial  (e.g.,  rule  I) 
or  temporal  (rule  e).  Thus  the  problem  of  recognizing 
patterns  in  collections  of  frames  can  be  cast  as  one  of 
abstract  geometry  independently  of  whether  the  patterns 
arise  from  spatial  or  temporal  data.  An  example  should 
make  this  dear:  consider  radial  lines  emanating  from  a 
common  point  (Figure  4).  These  may  arise  from  the 
common  point  of  projected  parallel  lines,  a  vanishing 
point,  or  from  the  common  focus  of  expansion  of  optic 
flow  due  to  a  a  translating  observer,  These  spatial  and 
temporal  phenomena  are  closely  related;  the  loci  of  points 
translating  with  respect  to  a  rectilinearly  moving  observer 
are  also  parallel  lines. 


r'  *  \ 
^  / 


/  Figure  4. 


The  global  transformation  to  detect  radial  lines  can  be 
carried  out  in  two  steps.  T"he  first  delects  the  cohnearity  of 
oriented  frames  and  the  second  delects  the  pattern,  in  line 
parameter  space,  which  is  due  to  the  radial  field.  In  the 
notation  for  parameter  transforms  [liallard,  1983],  we  cun 
describe  the  line  transform  as 

<(x,y,Ax,Ay),  (r,0), 

(0=lan  1  (Ay/ Ax) ;  r  =  xcosO  +  ysinO)>. 

The  parameters  Ax  and  Ay  describe  the  direction  of  the 
frame  vectors  at  a  location  x,y.  The  notation  <(  ),(  ),(  )> 


means  that  the  units  described  by  the  parameters  in  the 
first  set  of  parentheses  will  raise  the  activation  levels  ol  a 
subset  of  those  in  the  second  set  of  parentheses.  I  he 
relationships  in  the  third  set  of  parentheses  describes  the 
subset. 

Radial  lines  map  into  circles  in  (/>,())  parameter  space 
and  these  can  be  detected  by 

<(r,0),  (a,b),  (r/2  =  acosO  +  bsinO)> 

Note  that  at  this  point,  whether  (x,y,Ax,Ay)  arises  from 
intensity  gradients  or  flow  vectors  has  been  left 
indeterminaut.  Of  course  m  order  to  use  the  answer 
effectively,  one  must  know  whether  the  computations  are 
relevant  to  space  or  time. 

Some  of  the  results  that  can  be  computed  from  a 
frame  processor  are  valid  for  both  space  and  tune,  and 
others  are  only  valid  for  either  one  or  the  other  dimension. 
For  example,  the  distance  of  closest  approach,  Q  (given  by 
Fq.  7.1.3  in  f liallard  and  Brown,  1982]), 

Q2  =  (.vx)  -  (x  ())2/(0  0) 

where 

()  =  ( a , b,  1 )  and  x  -  ((f-z)x/f,  (f-z)y/f,  /) 

is  valid  for  both  space  and  lime,  but  "time-to  adjacency," 
giver,  by  the  HI 

<(a,b,x,y,Ax,Ay),  (l),  (t  =  d/||v||)> 

where 

INI  =  /  (Ax2  +  Ay2) 
and 

d  =  V((x-a)2  +  (y-b)2) 
is  only  valid  for  the  temporal  interpretation. 

As  another  example,  consider  the  detection  of  spiral 
patterns.  Since  these  can  be  perceived  on  the  order  of  100 
ms,  like  glass  patterns,  our  hypothesis  is  that  the 
perceptual  mechanism  must  be  manifested  as  connections 
(see  also  [Feldman  and  Ballard,  1982]),  However,  spiral 
patterns  that  arise  as  strictly  spatial  patterns  in  nature, 
while  possible,  are  rather  infrequent.  In  contrast,  spiral 
patterns  derived  from  temporal  loci  are  frequent 
experiences.  For  example,  as  discussed  earlier,  the  optic- 
flow  due  to  an  observer  translating  is  radial.  This  flow, 
summed  with  the  concentric  flow  produced  by  a  rotation 
about  the  direction  of  travel,  leads  to  spiral  temporal 
patterns.  Note  that  the  flow  is  present  over  'he  full  visual 
field,  even  for  a  short  temporal  durauon.  Thus  the  frame 
processor  architecture  impl.es  that  the  ability  to  recognize 
infrequent  spatial  spirals  is  a  direct  consequence  of  the 
ubiquitous  nature  of  temporal  spirals  that  have  the  same 
underlying  geometry. 
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Weuk  corroborative  evidence  for  such  transfer  comes 
from  the  l'raser  illusion  (figure  5).  In  this  illusion,  spirals 
are  seen  even  though  the  global  patterns  are  concentric 
circles,  Presumably  this  is  because  the  combined,  local 
evidence  pi  edicts’  spirals,  llus  is  precisely  the  effect  one 
would  expect  from  temporal  spirals  as  the  predominant 
experience  would  occur  in  spatio  temporally  local 
segments, 


figure  5. 

1  he  principal  virtue  of  having  a  frame  processor  is  that 
its  circuitry,  together  with  that  necessary  to  distinguish 
between  space  and  time,  is  much  less  than  that  required  to 
implement  two  independent  processors,  one  for  space  and 
one  for  time, 

One  of  the  problems  that  might  occur  with  a  single 
frame  processor  is  lhat  separate  spatial  and  temporal 
events  could  be  confused,  This  may  indeed  happen.  For 
example,  in  the  Pnlfrich  pendulum  illusion,  a  point  of  light 
oscillating  in  the  frontal  plane  is  viewed  with  two  lenses, 
one  of  which  has  been  darkened.  In  this  illusion,  the 
darker  input  is  interpreted  as  a  temporal  delay,  and  the 
point  is  seen  to  move  in  depth,  for  most  cases,  however, 
the  modes  ot  use  of  the  frame  processors  can  be  separated 
into  different  spatio-temporal  regimes  which  do  not 
overlap, 

4.  Computing  the  \  iewing Transformation 

An  object’s  viewframes  may  be  related  to  its  internal 
frame  representation  by  a  viewing  transformation.  Knowing 
any  two  of:  {the  internal  representation,  the  viewing 
transformation,  and  the  viewframes}  allows  the  third  to  be 
computed.  Usually  the  viewer-centered  data  are  known 
but  both  the  corresponding  internal  shape  and  viewing 
transformation  must  be  computed.  This  problem  is 
generally  underdetermined  [Palmer,  1981],  furthermore, 
the  itnage  data  is  usually  cluttered  with  many  features'  that 
belong  to  different  objects,  and  these  tend  to  confound  the 
perception  of  a  particular  shape.  Previous  work  [Ballard, 
1981;  Ballard  and  Sabbah,  1981;  Sloan  and  Ballard,  1 980] 
made  the  simplifying  assumption  that  the  internal 
representation  contains  only  a  single  object.  In  this  case 
the  viewing  transformation  could  be  computed  and  paris 
of  the  object  in  the  image  identified  despite  other  image 
clutter,  fhe  task  of  determining  if  a  known  object  is  in  an 
image  is  posed  as;  is  there  a  transformation  of  a  subset  of 
image  features  such  that  the  transformed  subset  can  be 
explained  as  the  object?  If  the  answer  to  this  question  is 


no,  then  the  object  is  not  present.  If  yes,  then  the 
transformation  provides  all  the  necessary  information 
about  i he  object,  In  a  eonnectionist  network,  t he 
affirmative  answer  is  represented  by  ihe  convergence  of  a 
view  transform  network  to  a  simple  aclive  unit. 

The  viewing  transformation  is  completely  specified  in 
the  2-d  case  by  four  parameters:  two  for  translation;  one 
for  orientation;  and  one  for  scale,  Ballard  and  Sabbah 
[1981]  have  shown  that  it  is  possible  to  decouple  the 
interdependence  of  two  subgroups  of  the  parameiers  for 
scale  and  orientation  from  translation.  In  other  words,  the 
orientation  and  scale  of  the  object  can  be  detected  without 
knowing  its  translation.  In  fact,  there  is  a  natural 
precedence  of  parameters; 

scale  >  orientation  >  translation 

I  his  precedence  stems  from  different  factors,  The  reason 
scale  is  simple  to  detect  is  lhat  it  is  available  from  intrinsic 
image  data  | Harrow  and  lenenbaiim,  1978],  An  intrinsic 
image  is  an  image  of  some  important  parameter  that  is 
relinotopic;  that  is,  in  registration  with  the  intensity  data 
on  the  viewer’s  retina.  For  scale  computations  the  most 
important  image  is  the  depth  map  |Marr  and  Nishihara, 
1978],  A  depth  map  represents  distances  with  respect  to 
the  viewer.  Thus'  if  the  internal  representations  have  an 
associated  absolute  metric  the  scale  between  an  nbieci 
centered  feature  and  a  viewer-centered  feature  can  ne 
immediately  determined.  Orientation  is  easier  to  delect 
than  translation  as  it  is  functionally  independent  from  it, 
whereas  the  reverse  is  not  true.  In  other  words,  viewei 
object  orientation  correspondences  can  be  computed 
without  considering  translation,  but  to  do  the  same  for 
translation  correspondences,  one  must  know  the 
appropriate  values  of  orientation  and  scale, 

Representing  the  View  Transform  with  ConnecttuiK 

fhe  connections  for  Hie  view  parameiers  may  be 
viewed  as  a  form  of  Hough  transform  using  constraint 
tables  [Ballard,  1981],  Matches  belween  image  frames  and 
object  frames  constrain  ihe  values’  for  the  viewing 
transform  parameters,  Hach  image  frame  maps  into  only  a 
set  of  allowable  parameter  values,  When  all  the  frame 
matches'  are  taken  into  account,  the  mapping  is  many  10 
one  onto  plausible  parameter  values,  (Since  the  cost  of  this 
method  is  exponential  in  the  number  of  parameiers 
considered  together,  ihe  decoupling  of  parameters  into 
groups  of  scale,  orientation,  and  translation  mentioned 
above  is  very  significant  [Ballard,  1983],  and  we  will  pursue 
this  in  a  moment,) 

In  this  paper  we  will  lestiicl  our  examples  to  two 
dimensions,  although  the  constraints  tor  the  3-d  case  are 
only  slightly  harder  [Ballard  and  Sabbah,  1981],  Consider  a 
2-d  primitive  specified  by  a  single  x  axis  vector  x  defined 
in  the  viewer  frame.  Suppose  il  corresponds  lo  a  vector  y 
in  the  object  frame,  'I  he  transformation  between  x  and  y  is 
specified  by  view  parameter  p  where  p  =  (o,  ax,  s),  These 
parameiers  correspond  to  orientation,  translation,  and 
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buile,  respectively.  In  the  parameter  network  a  particular 
unit  p  will  receive  conjunctive  connections  from 
appropriate  pairs  of  units  x  and  y  where 

Ax  =  x  y 

s  =  INI  /  ||y|| 

o  =  angle(x)  angle(y) 

where  "angle"  is  the  angle  between  the  vector  and  the  x- 
axis  in  the  appropriate  frame 

The  foregoing  description  ^ecifies  approximately  one 
third  of  the  connections  implied  by  the  view  transform 
relation  (A):  view  parameter  units  receive  connections 
from  model  units  and  frame  units.  However,  as  Hinton 
[1981c)  has  pointed  out,  the  role  of  the  entities  in  relation 
(A)  is  symmetric,  and  each  may  receive  connections  irom 
the  other  two.  For  example,  a  unit  x  will  receive 
conjunctive  connections  from  units'  y  and  |i  where 

x  =  y  +  Ax 
INI  -  s  ||y|| 
angle(x)  =  0  +  angle(y) 

I  he  conjunctive  connections  are  used  to  specify  that 
several  parameters  logically  need  to  be  present 
simultaneously  in  order  to  effect  the  behavior  of  a  unit. 
Where  the  connections  between  networks  are  symmetric, 
we  will  use  Minton's  notation  of  a  small  triangle  at  the 
junction  of  connecting  lines  (see  figure  6). 


Figure  6. 

Confidence  Updating  Strategy 

Given  ihe  input  connections  to  a  unit,  which  may  be 
conjunctive,  one  still  has  to  specify  how  to  update  the 
confidence  of  the  unit,  There  are  many  reasons  for  nol 
using  a  strict  summation  [l  eldman  and  Mallard,  1982],  We 
use  a  "normalized  maximum  of  sums  over  threshold" 
formula.  I  et  the  inputs  to  a  unit  be  organized  into  sets  ol 
conjunctive  connections  {SJ  and  let  each  such  set  have  a 
threshold  (),,  Let  Cj  be  the  sum  of  the  confidences  of  the 
units  in  Sr  I'hen  the  confidence  of  a  unit  j  is  updated  by: 

Cj  =  max  { S,  01)/(max{CJ}). 

I  he  behavior  of  the  numerator  is  easy  to  accommodate  in 
the  behavior  of  a  unit,  but  the  denominator  requires  a 
separate  network,  such  as  those  described  in  [Feldman  and 
Ballard,  1982], 


Split  Parameter  Spaces 

The  eonneetionist  implementation  of  the  view 
transform  computations  up  to  this  point  has  utilized  the 
constraints  developed  in  [Ballard  and  Sabbah,  19811  and  is 
a  variant  of  Hinton’s  letter  recogiuon  model  [Hinton, 
1981c],  However,  this  approach  requires  too  many  units, 
Consider  the  2-d  case,  If  we  allow  100  values  for  each  of 
scale,  orientation,  and  horizontal  and  vertical  translation, 
each  network  in  the  view  transform  requires  100^  units. 
More  problematic  is  the  approximately  1 00^  conjunctive 
connections  per  unit,  which  is  totally  unrealistic.  Hinton 

has  suggested  reducing  the  units  ny  using  nuns  wim 
overlapping  parameter  values  [Hinton,  1981b],  f his 
concept  reduces  the  number  of  units  by  a  factor  of  l/l)*1  1 
where  I)  is  the  diameter  of  the  unit  and  k  the  dimension 
of  the  parameter  vector.  While  this  is  a  dramatic  reduction 
and  biologically  plausible,  it  may  still  not  reduce  the 
number  ol  units  enough,  and  it  places  an  added  burden  on 
the  number  of  conjunctive  connections  required,  A 
complementary  way  of  reducing  the  number  of  units 
required  is  to  use  split  spaces.  Split  spaces  is  the  concept  of 
representing  a  high  dimensional  set  of  units  with  subsets 
of  units  of  lower  dimensionality,  For  example,  the  4  d 
model  frame  net  can  be  represented  as  two  networks,  one 
with  location  units  (x,y)  and  one  with  length  and 
orientation  units  (1,0).  Split  spaces  introduce  the  possibility 
of  erroneously  associating  parameter  subsets  but  the 
probability  of  a  false  association  can  be  made  extremely 
small  with  the  assumption  that  the  space  is  in  some  sense 
sparse  ’  (i.e.,  the  number  of  units  active  at  any  one  time  is 
not  too  large), 

Figure  7  shows  the  split  space  representation  of  the  2  d 
viewer  transformation  computation.  Flic  key  point  is  that 
the  computation  be  made  sequential  to  the  degree 
required  by  dependencies.  Thus,  translation  parameters  are 
not  computed  until  the  scale  and  rotation  parameters  have 
been  found.  I  he  process  is  dynamic  in  that  these  latter 
parameters  can  in  turn  indirectly  effect  the  previous  ones. 
The  network  contains  the  following  three  groups  of 
constraints: 

(1)  The  connections  for  scale  and  orientation.  As 
before,  one  can  use  conjunctive  connections  to 
determine  the  relationship  between  model  frame 
length  and  orientation,  image  frame  length  and 
orientation,  and  the  corresponding  view 
parameters.  The  difference  is  that  these  units 
represent  only  length  and  orientation  parameters 
and  do  not  involve  translation. 

(2)  I 'lie  connections  for  rotated  model  parameters.  I  he 
key  to  this  implementation  of  split  spaces  is  the  use 
of  rotated  model  frame  units  x1  related  to  x  by 

x 1  —  s  ROT(o)(x) 

where  Rot(O)  is  the  appropriate  rotation  matrix, 

(3)  I'he  connections  for  translation  parameters.  Since 
rotated  model  units  can  differ  from  image  units  by 
at  most  a  translation,  the  third  set  of  connections 
between  units  is  determined  bv  the  ennnlinn 

x'  =  x  +  Ax, 


intake 


Figure  7. 

Modes 

In  general  the  network  of  1'igure  7  i.s  not  adequate  to 
match  any  subset  of  image  frames  with  a  subset  of  model 
frames,  since  this  problem  is  nnderdetermined.  In 
connectionist  terms',  this  means  that  if  sets  of  model  frames 
for  many  different  prototypes  are  active,  together  with 
many  different  image  frames,  the  view  transform  network 
will  not  converge,  There  are,  however,  many  restricted 
cases  where  convergence  is  possible.  These  restricted  cases 
would  arise  from  connecting  the  view  transformation 
kernel  to  auxiliary  networks  such  as  those  described 
earlier,  We  can  partition  these  cases  into  three  important 
categories:  top-down  mode,  bottom-up  mode,  and  tracking 
mode,  In  top-down  mode,  a  single  object  is  being  sought, 
and  thus  the  model  frame  space  contains  only  active  units 
for  that  prototype.  In  this  case  the  view  transform  network 
will  converge  under  high  noise  levels  ( —  many  active  non 
object  units)  in  the  image  frame  space.  In  experiments 
[liallard  and  Sabbah,  1981],  up  to  four  to  five  times  the 
number  ot  frame  units'  could  be  activated  before  the  view 
transform  network  wotdd  converge  to  a  false  mapping.  In 
bottom  up  mode,  a  set  of  frames  corresponding  to  a  single 
unit  has  been  segmented,  and  the  problem  is  to  map  that 
set  onto  one  of  a  collection  of  simultaneously  active 
prototypes.  Iixperiments  in  this  mode  have  not  been  done, 
but  the  symmetry  of  this  mode  with  top  down  mode 
suggests  that  similar  results  would  apply.  This  does  not 
mean  that  five  prototypes  could  be  tested  for  at  once; 
rather,  the  union  of  the  number  of  frames  in  all  of  the 
prototypes  should  not  exceed  four  to  five  times  the 
number  of  frames  in  any  one  prototype.  Obviously  a  huge 
number  of  prototypes  could  be  tested  for  simultaneously, 
Where  N  is  the  number  of  frames,  the  number  of 
prototypes  is  (4^  ).  1'he  third  mode  of  the  view  transform 
network  is  that  of  tracking,  In  this  mode  segmented  image 
frames  are  'translered"  to  the  model  frame  space  at  an 
acquisition  time.  ITiis  is  done  by  printing  the 
transformation  network  with  the  identity  transform. 
Thereafter  the  view  transform  records  the  transformation 
that  the  image  frame  data  undergo.  Tracking  mode  is 
somewhat  different  than  the  other  two  modes  in  that  the 


model  prototype  does  not  have  to  be  known  a  prion: 
instead,  at  some  acquisition  time,  the  image  frames  can  be 
used  as  model  frames.  If  this  prototype  is  lo  be 
remembered,  then  some  mechanism  must  handle  this.  One 
good  possibility  is  that  ot  recruitment  [Feldman,  1V8 1  hj. 

I  racking  mode  may  also  be  used  to  recognize  spatial 
regularities.  Suppose  we  are  given  the  pattern  of  figure  8. 
further  suppose  that  by  some  fixation  technique  one 
element  of  this  pattern  can  be  "loaded"  into  the  model 
frame  space.  The  connection  patterns  will  naturally 
compute  all  the  view  transforms  between  the  model  frame 
units  and  the  other  instances  of  the  image  frame  units.  In 
the  ease  of  the  example,  f  igure  8,  a  linear  pattern  ot  acme 
units  will  be  seen  in  view  transform  space.  This  linear 
pattern  explicitly  captures  the  main  pail  of  the  iioi'on  ol 
regularity  seen  in  the  original  pattern. 


figure  8. 

5.  l  ocus  of  Attention 

In  this  section  we  argue  two  points:  (1)  that  shape 
representations  are  hierarchical  [Marr,  1982];  and  (2)  that 
levels  of  a  hierarchy  are  necessarily  examined  sequentially, 

The  Need  for  Hierarchical  Descriptions 

Hierarchical  shape  descriptions  were  eloquently 
defended  by  Marr  and  Nishihara  [1978],  and  our  notion  of 
such  is  essentially  captured  in  their  wotk,  We  do  not  insist 
on  generalized  cone  primitives  as  they  did,  but  require  that 
whatever  representations  are  used  have  a  geometrical  basis 
and  defining  coordinate  frame.  Thus,  the  crucial  part  of 
the  representation  is  the  hierarchical  organization  oj 
different  coordinate  frames.  The  part  descriptions  with 
respect  to  those  frames  could  take  many  forms,  e,g„ 
polyhedra  or  splines, 

There  are  several  reasons  why  frame  hierarchies  are 
important,  first,  complex  objects  are  more  simply 
represented  by  pieces  described  with  respect  to  different 
coordinate  frames  than  with  a  complicated  description  that 
uses  a  single  frame.  Second,  the  hierarchical  organization 
of  these  frames  allows  for  ease  of  indexing  and  the  explicit 
representation  of  intermediate  hypotheses.  In  a 
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computational  scheme  tor  accessing  the  details  ot  the  parts 
through  their  more  general  features,  hierarchical 
organization  allows  for  strategies  that  take  steps 
proportional  to  the  logarithm  of  the  number  of  parts.  A 
final  reason  for  hierarchies  is  that  their  varying  "grains" 
form  a  logical  description  of  the  different  viewing 
conditions  encountered  by  the  pereeiver.  When  an  object 
is  distant,  only  gross  features  will  he  apparent  owing  to 

limitations  of  the  imaging  optics  (as  well  as  others  in  post 
processing),  A  proximal  object  may  reveal  details  but  may 
be  so  close  that  it  cannot  ire  imaged  in  a  single  view,  In 
these  cases  the  identification  problem  is  simplified  if  the 
shape  representation  itself  is  also  segmented  in  terms  of 
the  resolution  of  its  parts. 

Shape  hierarchies  can  be  defined  relative  to  different 
complexity  measures.  These  are  closely  tied  to  the  type  of 
parameterization  allowed,  including  the  aforementioned 
"grains"  (shape  distortion),  connectivity  relations,  and  the 
articulation  freedom  allowed  in  the  parts  of  an  object, 
However,  there  are  strong  dependencies  between  these 
factors  when  one  considers  how  recognition  proceeds  For 
example,  in  perceiving  a  horse,  it  is  necessary  to  determine 
exactly  where  its  neck  is  relative  to  its  body  before 
considering  its  detailed  shape,  In  the  following  discussion 
we  will  be  concerned  mainly  with  the  representation  of 
articulation  in  objects. 

The  I 'tame  Hierarchy  in  Conner lionisl  Nets 

The  above  points  speak  of  the  necessity  of  hierarchies 
but  are  neutral  with  respect  to  their  implementation  in 
hardware.  To  describe  the  implementation  of  shape 
hierarchies  in  nets,  we  turn  first  to  the  structure  of  the 
model  identifier  net. 

The  model  identifier  net  receives  connections  from  the 
mode!  frame  net  and  also  connects  to  it.  The  purpose  of 
such  connections  is  to  implement  the  relation  (li) 
described  earlier.  The  units  in  the  model  frame  are  in  a 
canonical  form;  that  is.  they  are  independent  of  the  view 
(that  variation  is  captured  by  view  parameter  units  and  the 
view  transform).  The  canonical  form  greatly  simplifies  the 
implementation  of  relation  (A)  since,  foi  example,  a 
horse’s  back  frame  will  always  correspond  to  the  same 
model  unit. 

In  general,  the  connections  between  units  in  these  two 
representations  will  not  be  one-to-one.  The  teason  is  that 
the  model  identifier  net  has  a  relational  character;  a 
horse’s  neck  unit  in  that  frame  corresponds  to  several 
possible  units  in  the  model  frame  net,  owing  to  the  fact 
that  the  neck  can  move  relative  10  some  fixed  part  in  the 
body  frame  (which  we  choose  to  be  the  back).  I  igure  9 
illustrates  this.  A  second  point  is  that  units  in  the  model 
frame  net  also  receive  connections  from  the  image  and 
view  parameter  networks.  The  intersection  of  units 
receiving  excitation  will  help  specify  the  appropriate 
corresponding  model  identifier  units. 

We  use  conjunctive  connections  from  the  frame  nodes 
in  the  model  identifier  nets  10  appropriate  model  frame 
units.  Thus,  a  model  frame  unit  may  become  active  only  if 
it  is  receiving  image  input,  and  is  in  the  correct  "context," 


units  connected  to 
horse's  neck  unit 
(above ) 


l  ^  e.q,  of  a  particular  unit 
receiving  inputs  from 
- -image  frame  and  view 
\  transform 

v  (NECK  PORTION  OF  MODEL  FRAME  NET) 


Figure  9:  A  partial  example  of  the  model  identifier  net  for 
a  horse,  and  the  model  for  one  of  its  snbparts. 

Frame  Snitching 

T  he  most  important  point  of  the  previous  constraint  in 
terms  of  computational  complexity  is  that  although  all 
identifier  units  are  connected  to  model  frame  units,  only  a 
small  portion  are  active  at  any  one  time.  To  see  the 
importance  of  this  decision,  consider  an  alternative:  a 
separate  model  frame  net  for  each  identifier  unit,  This 
would  allow  all  possible  shapes  to  be  processed  in  parallel 
but  would  require  an  unrealistic  amount  of  units.  Our 
principal  hypothesis  is  that  there  is  limited  hardware  to 
compute  the  view  transform  (here  we  allow  only  one  piece 
of  hardware)  and  the  meaning  of  the  active  iiniu;  therein  is 
determined  by  the  active  model  identifier  units  connected 
to  it. 

Given  that  only  n  small  set  of  model  identifier  units 
can  be  active  at  any  one  lime,  one  needs  a  mechanism  for 
switching  between  sets  of  units  at  different  hierarchical 
levels.  An  example  of  such  a  mechanism  is  diagrammed  in 
Figure  iO.  The  first  problem  is  to  select  a  different  frame 
to  examine.  This  is  accomplished  by  putting  the  model 
frame  net  into  "Winner  rake-AIT'-uiode,  (Tor  a  discussion 
of  WTA  nets  the  reader  is  referred  to  l  eldman  and 
Ballard’s  work  [Feldman  and  Ballard,  1982].)  Selecting  a 
single  unit  from  the  model  frame  net  has  three  effects: 

(1)  it  enables  a  frame  switch  unit; 

(2)  it  deactivates  all  but  a  subset  of  the  currently  active 
model  identifier  units; 

(3)  it  allows  the  appropriate  view  transform  change  to 
take  place. 

Referring  to  Figure  10,  the  frame  switch  unit  connects  to 
all  particular  frame  switch  units  but  since  this  is  a 
conjunctive  connection  with  model-identifier  units  only  an 
appropriate  subset  of  particular  frame  switch  units  will  ne 
excited.  An  activated  frame  switch  unit  m  turn  activates  its 
corresponding  frame  unit,  and  deactivates  the  ancestor 
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trjine  I  he  newly  nelivaied  frame  unit  will  then  excite  the 
model  identifier  unit  members  of  its  frame.  We  have  also 
designed  a  transform  switch  mechanism  which  proceeds  in 
lock-step  with  this,  and  allows  image  data  coupled  with 
current  hypotheses  to  compute  the  next  transform, 
Ho  vever,  due  to  lack  of  space,  we  omit  its  presentation 
here. 

A  particular  advantage  of  this  design  is  that  evidence 
tor  a  mote  abstract  unit  can  accumulate  even  though  a  less 
abstract  fiame  is  active.  liven  when  horn's  bade  is  not  in 
the  cm  rent  ratne,  its  activation  can  be  increased  indirectly 
through  ancestor  connections  from  an  active  /tone's  ear 
frame. 


figure  10:  An  example  of  the  frame  swichtng  mechanism, 
in  which  attention  switches  from  horse's  body  to  neck. 
Connections  to  model  units  are  not  shown  The  link  with  a 
circle  at  its  lip  represents  an  inhibitory  connection. 

Some  Motivation  for  Frame  Switching 

I  he  foregoing  discussion  developed  the  motivation  for 
frame  switching  from  representational  grounds  In  a 
connectionisi  network,  the  representations  are  necessarily 
distributed  into  pieces  because  of  bounds  on  the  number 
of  connections  per  unit  [Feldman  and  Ballard,  19: <2J, 
lere  aie  at  least  two  additional  arguments,  however  lor 
frame  switching,  une  is  that  the  mechanism  can  resolve 
ambiguities  which  arise  from  split-space  representations 
(Section  4).  The  other  argument  is  derived  from 
psychological  tests. 

A  problem  with  any  split-space  representation  is  that 
one  cannot  keep  track  of  the  correspondences  between 
units  in  each  subspace.  This  problem  is  shown  by  figure 
II,  where  the  model  frames  are  different  from  the  image 
frames,  but  this  difference  does  not  show  up  in  any  of  the 
subspace  computations.  One  solution  is  to  have  a 
hierarchical  representation  of  the  shape.  In  the  example, 
the  primitives  would  be  represented  once  as  separate  units 
and  again  as  units  which  are  part  of  a  global  frame.  If  the 
object  is  rigid,  one  can  switch  the  focus  to  a  single  unit 
while  changing  the  viewing  transform  in  a  predictable  way. 

I  he  tact  that  this  cannot  be  done  with  the  example  in 
figure  8  is  the  mechanism  for  detecting  the  mismatch 

A  second  indication  of  the  importance  of  frame 
switching  is  due  to  a  clever  construction  by  Pavel  (f  igure 
11).  In  this  figure  Ihree  tokens  are  moved  alung  linear 


loci,  If  the  tokens  are  rotationally  symmetrical  (series  a), 
the  overall  perception  is  that  of  a  moving  triangle,  hut  if 
some  pronounced  asymmetry  is  given  the  tokens  (series  b), 
the  perception  immediately  switches  to  that  of  three 
independent  translations,  the  explanation  for  this 
Pe'ception  in  terms  of  onr  eonneetionist  model  uses  a 
collection  of  the  mechanisms  already  suggested.  Fust,  the 
tokens  are  analyzed  by  pulling  the  view  transform  in 
tracking  inode.  I  he  rotationally  symmetric  tokens  can 
suggest  I rames  (rule  t)  that  excite  a  single  set  ot  view  mins. 
Fven  if  frame  switching  is  used,  the  view  transform  is  still 
supported,  owing  to  ihe  rotational  degree  of  freedom  in  a 
single  token.  In  the  second  series,  huwevei,  t tie  situation  is 
very  different.  It  the  tokens  have  a  pronounced 
asymmetry,  the  corresponding  frame  units  will 
piedominute,  and  their  loci  are  incompatible  with  a  single 
set  of  view  transform  units. 


«• 


Figure  II. 

One  feature  of  out  frame  switching  model  is  that  it  is 
necessarily  discrete.  Focus  of  attention  switches  between 
individual  frame  units  that  have  discrete  separations.  Some 
support  for  this  model  can  be  derived  from  the  classic 
experiments  in  eye  movement  tracking  [larbus,  1967J. 
Subjects  examining  pictures  typically  used  saccadic  eye 
movements  that  varied  as  a  function  of  task  required  but 
more  importantly: 

subjects  appear  to  have  a  fixed  hierarchy  of  interest 
protocols ,  e.g.,  humans  >  animate  >  inanimate,  etc. 

subjects  appear  to  have  fixed  scanning  protocols, 
subjects  examining  the  same  picture  on 
different  days  would  exhibit  essentially  similar  eye 
movement  patterns. 

6.  Summary  and  Conclusions 

Figure  1  z  summarizes  Ihe  various  constraints 
employed  by  our  connectionisi  model.  In  this  figure  we 
have  abandoned  the  triangle  notation  and  only  indicate  the 
relations  between  nets  without  describing  the  detailed 
nature  ot  the  connections.  I’his  is  remedied  by  earlier 
descriptions  and  figures. 
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The  validity  of  the  new  transform  approach  Inis  been 
tested  in  non  connectionisi  algorithms  [Dullard,  1981; 
Bullard  and  Sabbah,  1981;  Sloan  and  Ballard,  1980], 
Currently  the  connectioinst  version  ot  relation  (A)  and  a 

frame  switching  mechanism  similar  to  the  one  described 
have  been  successfully  implemented;  other  networks  are 
being  programmed. 


A  difficult  problem  for  any  form  perception  model  r 
to  explain  the  perception  of  "fruit  face"  [Palmer.  1975],  as 
shown  by  Figure  13,  (Many  similar  examples  can  be 
constructed.)  Detailed  experimentation  with  our  model 
will  be  necessary  to  determine  whether  it  can  exhibit 
appropriate  oscillatory  behavior  between,  eg.,  seeing  a 
cherry  as  a  cherry  and  seeing  a  cherry  as  an  eye,  However, 
the  representation  ai'ows  for  such  behavior,  figure  13 
shows  the  two  states  of  the  nets  corresponding  to  the  two 
different  perceptions,  On  use  of  frames  is  similar  to 
Hayes’  [1978],  and  satisfies  Minion's  notion  of  a  system 
that  uses  the  same  image  features  but  assigns  them 
different  "roles"  [I  I  niton,  1981c),  In  seeing  the  cherry  as  an 
eye,  the  face  frame  is  active  and  the  geometric  frame 
features  for  the  cherry  are  part  of  active  connections  to  the 
eye  unit  via  the  view  transform,  In  seeing  the  cherry  as  a 
cherry,  both  the  cherry  frame  and  the  cherry  unit  in  the 
model  identifier  net  are  active, 
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Figure  13:  Plausible  alternatives  for  "fruit  face.” 

Many  refinements  of  the  basic  approach  are  currently 
being  studied,  Feldman  (1982)  is  developing  a 
computational  basis  for  representing  objects  tit  space  m 
terms  of  lour  coordinate  systems.  This  work  meshes  with 
our  own  in  that  more  immediate  (foveal)  and  more 
abstract  (environment)  frames  are  described  as  well  as 
frames  similar  to  our  image  and  model  frames,  An 
incorporation  of  a  foveal/eye  movement  mechanism  would 
be  of  immediate  advantage  to  the  current  system,  By 
adjusting  viewing  parameters  (e.g„  centering  the  view  on 
the  current  frame)  one  could  minimize  the  units  that  have 
to  be  represented  in  the  view  transform, 

One  issue  that  we  have  sidestepped  is  that  ot  the  most 
abstract  control.  What  triggers  a  frame  switch?  Many 
possibilities  exist,  e.g„  breadth  first  scanning  in  model 
identifier  space,  the  same  In  model-frame  space,  a  "pre¬ 
wired"  pattern  of  checking,  as  well  as  others,  but  the 
problem  is  still  open,  However,  as  Posner  [1978]  suggests, 
our  job  may  be  to  give  the  homunculus  less  anti  less  to  do; 
hence  our  confidence  in  the  present  system, 
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O  Abstract 


This  paper  develops  a  simple  and  robust  procedure  for 
recovering  sensor  motion  parameters  from  image 
sequences  induced  by  unconstrained  sensor  motion 
relative  to  a  stationary  environment.  Difference 
vectors  of  optic  flows  approximate  the  orientations  of 
the  translational  field  lines  in  image  areas  in  which 
there  is  depth  variance  between  the  corresponding 
environmental  points  and  sufficient  angular  separation 
from  the  translational  axis.  This  is  developed  into  a 
procedure  consisting  of  four  steps:  1)  Hcally  computing 
difference  vectors  from  an  optic  flow  field;  2) 
thresholding  the  difference  vectors;  3)  minimizing  the 
angles  between  the  difference  vector  field  and  a  set  of 
radial  field  lines  which  correspond  to  a  particu'ar 
translational  axis;  and  4)  extracting  the  translational  and 
rotational  component  fields  given  the  translational  axis. 
This  procedure  does  not  require  a  priori  knowledge 
about  sensor  motion  or  structure  of  the  scene.  It 
depends  critically  on  sufficient  variation  in  depth  along 
some  vi.ual  directions  to  endow  the  flow  field  with 
discontinuities.  We  present  results  of  applying  the 
procedure  to  sparse  and  low  resolution  displacement 
fields. 


Introduction 

The  motion  of  an  observer/sensor  is  in  general 
composed  of  a  translation  and  a  rotation.  It  generates 
an  optic  flow  field  in  the  image  plane  of  the  sensor 
due  to  changes  of  visual  directions  of  details  in  the 
environment  over  time  (Gibson  et.  at.  1955).  The 
translation  of  the  sensor  induces  a  radial  flow  in  the 
image  with  the  intersection  of  the  translational  axis  and 
image  plane  as  its  center.  Sensor  rotatation  induces  a 
rotational  field  in  the  image  that  is  purely  direction 
dependent  (that  is,  a  function  of  image  position  only). 


The  translational  component  (and  its  spatial  and 
temporal  derivative  fields)  contains,  e.g.,  information 
about  the  shape  of  objects  (Koenderink  and  van  Doom 
1977),  about  the  relative  depth  properties  of  the 
environment  (Lee  1950,  Prazdny  1980),  or  about  motion 
parameters  for  navigating  along  curved  trajectories 
(Rieger  1983).  Processing  optic  flows  induced  by 
observer/sensor  motion  can  be  done  by  decomposing  a 
flow  field  into  its  rotational  and  translational 
components  and  then  recovering  the  environmental 
information  from  the  translational  component. 
Techniques  for  doing  this  generally  require  high 
resolution  image  displacements  as  input  and  are 
sensitive  to  the  noise  and  error  that  current  techniques 
for  determining  image  motions  typically  produce.  They 
can  also  involve  solving  complex  equations  and  require 
significant  computation. 

The  recovery  of  sensor  motion  parameters  can  be 
simplified  considerably  by  making  use  of  the 
geometrical  structure  of  optic  flows  in  regions 
corresponding  to  environmental  depth  changes.  In  such 
regions  the  difference  vectors  that  have  been  computed 
over  some  neighborhood  are  oriented  approximately 
along  translational  field  lines.  This  can  be  seen  easily 
for  the  case  of  details  that  are  located  exactly  in 
the  same  direction  from  an  observer/sensor  but  are  at 
different  depths  (such  as  points  along  occluding 
boundaries)  :  as  observed  by  Longuet-IIiggins  and 
Prazdny  (1980),  such  points  will  differ  in  their  image 
velocity  vectors  by  the  difference  of  their  translational 
components  only.  This  is  because  the  rotational 
components  of  optic  flows  are  purely  direction 
dependent  and  thus  equal  for  flow  vectors  positioned  at 
the  same  image  point.  The  axis  of  sensor  translation 
can  then  be  obtained  from  the  intersection  of  radial 
fieldlines  which  are  determined  by  such  difference 
vectors.  Given  the  axis  of  translation,  the  rotational  and 
translational  component  fields  are  strongly 
overdetermined. 
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There  are  significant  difficulties  in  applying  this 
observation  to  actual  image  sequences.  Flow  fields 
computed  from  actual  image  sequences  axe  not 
arbitrarily  dense  and  are  in  fact  generally  sparse  so 
there  will  not  be  two  distinct  flow  vectors  positioned  at 
the  same  image  point.  Thus  it  is  necessary  to  perform 
the  computation  using  difference  vectors  determined 
from  image  displacement  vectors  which  are  spatially 
separated.  From  images  formed  at  discrete,  successive 
instants  we  obtain  image  displacements  and  not 
instantaneous  optic  velocities.  Thus  the  computation 
must  be  expressed  in  terms  of  discrete  sensor  motions. 
Also,  real  flow  fields  are  noisy  and  errorful,  especially 
near  occlusion  boundaries  because  of  the  changes  in 
image  structure  that  occur  there.  Thus  the  procedure 
must  be  robust  to  such  distortions  in  the  determined 
difference  vectors.  We  have  found  that  subtracting 
spatially  separated  image  displacement  vectors  with 
different  corresponding  environmental  depths,  will  give 
reliable  approximations  to  the  correct  translational 
field  lines.  Further,  the  resulting  field  of  difference 
vectors  will  approximate  a  noisy  translational 
displacement  field  which  can  be  processed  using  general 
Hough  techniques  (Rieger  and  Lawton  1983). 


Difference  Vectors  from  Spatially  Separated 
Flow  Vectors 


Here  we  present  results  on  the  effects  of  using 
spatially  separated  image  velocity  vectors  to  determine 
difference  vectors.  A  difference  vector  formed  from 
spatially  separated  image  velocity  vectors  can  be 
decomposed  into  a  signal  ;omponent  oriented  along 
the  correct  translational  field  line  and  a  noise 
coPiponent.  We  find  that  the  signal  component 
increases  for  difference  vectors  formed  at  image 
locations  where  large  depth  changes  occur  in  the 
corresponding  environmental  positions.  It  also  increases 
with  increasing  distance  between  the  difference  vector 
and  the  intersection  of  the  translational  axis  with  the 
image  plane.  To  the  extent  that  these  conditions  arc 
satisfied  for  an  optic  flow  field,  its  difference  vector 
field  will  approach  the  corresponding  set  of  correct 
translational  field  lines.  The  computation  of  difference 
vectors  over  the  image  does  not  require  initially 
determining  the  location  of  occlusion  boundaries  or  of 
image  areas  corrsponding  to  large  visual  slant. 


Consider  a  sensor  O  moving  relative  to  a  static 
environment.  As  in  figure  1  the  point  P  at  the  image 
position  f  =  (x,y)  =  (x/z,y/z)  corresponds  to  the 

environmental  point  P  at  the  location  r  =  (x,y,z).  We 

obtain  the  image  velocity  u  at  P  by  differentiating  wrt 
time 

u  =  [(x  -xz)  ex+(y  -  yz)  ey]  /z  . 

Letting  v  =  (vx,Vy,vz)  and  to  =  (tox,o)y,toz)  denote  the 

»rnnr1n«;nnnl _ I _ __  t  .  S  a  . 
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translational  and  rotational  velocities  of  O  the  relative 

motion  of  P  becomes 

r  =  -v  -to  x  r  . 

Eliminating  x,  y,  and  z  between  the  above  equations 

we  can  write  the  translational  and  rotational 

components  of  image  velocity  u  separately 
UT  =  U*vz  -  vx)  ex  +  (yvz  -  v  )  eJ  /z  , 

_  a, 


Ur  -  (-toy  +  ytoz  -  x2(Oy  +  iywx)  ex 


+  (-xioz  +  iox  -  iywv  +  y2i 


•y  •  y  Sr)  «y 
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Two  image  points  P j  and  P2  that  are  separated  by 

f2~*l  =  (dx,dy)  differ  in  their  rotational  flow  vectors 
by 

AuR  =  UR2  “  UR1 


=  [dy<Oz  -  dx  (2X|  +  dx)  <0y 
+  (yidx  +  *ldy  +  dxdy)  (0SJ  ej 
+  [-dxwz  +  dy  (2yj  +  dy)  w, 

-  (yjdx  +  xjdy  +  dxdy)  wy)  ty 
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If  fj  =  (vjf'vz.vyvz)  denotes  the  intersection  of  the 
translational  axis  with  the  image  plane  we  can  rewrite 
the  translational  flow  vector  as  uy  =  vz(r-fjVz.  Then 
the  difference  vector  of  two  translational  flow  vectors 
at  separated  image  positions  Pj  and  Pt  becomes 
Ally  =  Uyo  “  Uyj 


where 


=  Z  \h 

Z!  +  Az  l  2 
Az  =  Z2  -  Zj 


"  fl  +  ^(fl  “  *T>  ]  • 
is  the  depth  separation  of  the 


environmental  details  Pj  and  P2  that  correspond  to  Pj 
and  P2  in  the  image. 


Now  we  can  rewrite  Au  as  consisting  of  a 
component  along  a  translational  fieldline  and  a  noise 
component 

Au  =  Auy  +  AuR  = 

[  £<‘l  -  *T)  lslgnal+  [  z^2  "  fl)  +  AuR  Lise' 
For  difference  vectors  with  sufficient  angular  separation 
from  the  translatory  axis  and  separation  in  depth 

^“Signal  >>^uNoise- 


Recovery  of  Motion  Parameters  and  Depth 

In  order  to  compute  difference  vectors  from 
image  displacement  fields  formed  over  discrete  time 
intervals  (as  opposed  to  continuous  intantaneous  image 
velocity  fields),  we  have  to  be  careful  to  describe  all 
quantities  with  respect  to  the  same  reference  system. 
Suppose  two  environmental  points  lie  along  the  same 
ray  of  projection  in  an  image  at  time  t.  Translating 
and  rotating  the  sensor  will  displace  the  projections  of 
these  points  to  new  positions  in  the  image  at  time 
t  +  1.  In  the  image  at  time  t  +  1,  the  image  points 
will  be  sepaiated  due  to  the  translational  component  of 
the  sensor  motion  (unless  they  are  located  on  the 

translational  axis).  The  separated  image  points  and  the 

intersection  of  the  translational  axis  with  the  image 

plane  will  be  collmear  at  time  t  +  1.  This  is  the 

discrete  analog  of  the  fact  that  difference  vectors  at 
discontinuities  of  an  instantaneous  optic  velocity  field 
are  oriented  along  translational  field  lines.  Thus,  given 
image  displacements  D1  and  D2  at  positions  PI  and 
P2,  the  difference  vector  between  points  1  and  2  is 
obtained  by  subtracting  D2  from  Dl  and  positioning  the 
resulting  vector  at  PI  +  Dl. 


Two  thresholds  are  used  in  evaluating  difference 
vectors.  The  separation  threshold  determines  the 
maximal  allowable  distance  between  displacement 
vectors  in  determining  difference  vectors  The 
neighborhood  of  a  given  displacement  vector  contains  all 
other  displacement  vectors  which  he  within  a  distance 
determined  by  the  separation  threshold.  The  length 

threshold  determines  the  minimal  allowable  lengih  for  a 
difference  veetor.  For  a  given  difference  vector  and  a 
set  of  radial  field  lines,  the  error  angle  is  the  angle 
between  the  difference  vector  and  the  ficldline  at  that 
position. 

We  have  found  that  reducing  the  number  of 

difference  vectors  by  increasing  the  length  threshold 
and  decreasing  the  separation  threshold  improves  the  fit 
of  the  difference  vector  field  to  the  set  of  coirect  field 
lines  up  to  a  certain  degree.  This  is  because  short 

difference  vectors  (compared  to  the  local  average 
magnitude)  are  more  likely  to  deviate  from  the 
eorrcct  field  lines  and  computing  difference  vectors 
over  larger  neighborhoods  increases  the  noise 
components.  If,  however,  thresholding  eliminates  too 
many  difference  vectors  the  fit  detonates,  since  the 
signal  of  the  difference  vector  field  becomes  less 
distinguished  for  a  decreasing  number  of  vectors. 

For  each  image  displacement  vector  a  set  of 

difference  vectors  of  sufficient  length  is  determined 
over  its  neighborhood.  For  the  resulting  field  of 
difference  vectors,  processing  involves  finding  a 
translational  axis  and  the  corresponding  set  of  radial 
field  lines  which  minimizes  the  sum  of  the  magnitude 
of  the  error  angles.  The  procedure  used  is  basically 
that  used  in  Lawton  (19S2,  1983)  to  determine  the 
translational  axis  from  noisy  displacement  fields  induced 
by  rectilinear  sensor  motion.  The  error  measure  is 
defined  on  a  half  sphere,  where  points  on  the  half 
sphere  are  possible  candidates  for  the  translational 
axis.  The  advantage  of  using  a  sphere  as  a  domain  is 
that  it  allows  for  a  uniform,  global  sampling  of  the 
error  function.  The  search  process  itself  consists  of  a 
global  sampling  of  the  error  measure  to  determine  its 
rough  shape  using  a  generalized  Hough  transform 
(Ballard  19S0,  O  Rourke  1981)  followed  by  a  local 
search  to  find  a  minimum. 

The  computation  of  the  sensor  rotation  (scaled  by 
focal  length)  from  the  original  flow  field  and  the  radial 
(translational)  fieldlines  is  straightforward.  Note  that  the 
components  of  the  flow  perpendicular  to  the  radial 
fieldlines  are  induced  by  sensor  rotation.  Introducing, 
for  convenience,  a  polar  coordinate  system  (r,0)  in  the 

image  plane  centered  at  Py  we  have  a  system  of 
overconstrained  linear  equations  of  the  type 

UR  •  e0  =  u  •  efl  in  the  three  unknowns  u>x,  Wy,  and 
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Knowing  the  rotational  parameter;  yields  the 
translational  and  rotational  component  fields  of  the 
orginal  flow  field.  The  translational  component  is 
directly  related  to  the  relative  depth  of  a  scene  (i.e. 
the  depth  scaled  by  the  sensor  displacement  in  depth 
8z)  by  the  relation  z/8z  =  I  r  -  fj  I  /I  uy  I  ,  where 
u-j-  is  a  translational  flow  vector  i.i  the  image.  If  the 
frame  rate  is  known  the  relative  depth  of  an 
environmental  point  corresponds  to  its  temporal 
separation  from  the  ensor  (under  constant  approach 
velocity).  Biological  systems  seem  to  exploit  this 
optical  relation  for  a  variety  of  navigational  tasks  (Lee 
1980,  Wagner  i982). 


Experiments 

The  flow  field  in  figure  2a  shows  image 
displacements  at  pixel  positions  having  coordinates 
which  are  multiples  of  8  from  a  128  x  128  pixel  field 
The  components  of  the  displacement  vectors  were 
stored  as  8  bit  integers.  The  environment  consisted  of 
a  spherical  surface  patch  at  depth  of  10  units  along  the 
z  axis  and  a  background  spherical  surface  patch  at  a 
depth  of  30  units  along  the  z  axis.  The  obvious 
discontinuities  in  the  flow  field  in  figure  2a  indicate 
the  boundary  of  the  nearer  surface.  The  censor  motion 
consisted  of  an  intial  rotation  of  0.1  radians  about  the 
(1,1,1)  axis  followed  by  a  translation  of  2  units  along 
(0,0,1).  The  separation  threshold  was  set  to  1  pixel  and 
the  length  threshold  was  set  to  3  pixels.  Figure  2b 
shows  the  average  difference  vectors  which  exceeded 
the  length  threshold.  Note  their  occurrance  along  the 
occlusion  boundary  and  their  strong  radial  character. 
The  resulting  error  function  is  shown  in  figure  2c 
(Darker  in  the  figure  corresponds  to  less  error;  also 
recall  that  this  is  a  plot  of  a  hemisphere  in  polar 
coordinates  and  not  the  image  plane).  As  can  be  seen, 
it  is  strongly  unimodal.  The  minimum  in  the  global 
histogram  corresponded  to  the  image  position  (60,  60). 
The  local  search  determined  the  minimum  to  be  at  (63, 
63).  The  correct,  subpixel,  position  was  (635,  635). 
The  rotational  component  was  found  by  optimizing  a 
simple  expression  describing  the  extent  to  which  a 
rotational  field  had  vector  components  perpendicular 
to  the  radial  field  lines  (determined  by  the  translational 
axis)  which  were  identical  to  those  of  the  orginal  flow 
field  in  figure  2a.  The  resulting  rotational  and 
translational  components  are  shown  in  figures  2d  and  2e 
respectively.  The  relative  depth  map  determined  from 
the  translational  component  field  is  shown  in  figure  2f 
encoded  by  intensity  (darker  means  closer  to  the 
observer). 
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The  flow  field  in  figure  3a  was  produced  by  a 
rotation  of  0.1  radians  about  the  x  axis  followed  by  a 
translation  of  2  units  along  the  z  axis  The 
environment  consisted  of  a  planar  surface  onented 
perpendicular  to  the  z  axis  and  20  units  away.  This  is 
a  situation  in  which  environmental  depth  variance  is 
minimized.  The  difference  field,  computed  by 
averaging  the  difference  vectors  in  a  neighborhood 
determined  by  a  separation  threshold  of  1.0  and  a 
length  threshold  of  0.0,  is  shown  in  figure  3b.  The 
difference  field  in  figure  3b  is  at  a  resolution  100  times 
greater  than  the  field  in  figure  3b  because  the 
difference  vectors  are  very  small.  This  reflects  that 
any  inference  of  the  translational  axis  in  this  case 
would  be  spurious. 


Figure  3b 


Figure  2} 


Figure  5a 


Figure  5b 


lTie  flow  field  in  figure  4a  was  produced  by  the 
same  motion  as  above  except  the  environmental  depths 
of  the  points  were  randomly  distributed  between  20  and 
100  units  along  the  z  axis  The  corresponding 

difference  field  is  shown  in  figure  4b  (the  resolution  of 
this  figure  is  3  times  greater  than  that  in  figure  4a). 
The  strong  radial  character  of  the  difference  field, 
with  a  Focus  of  Expansion  at  the  center,  is  apparent. 


Figures  and  5b  arc  128x128  pixel  images  with 
256  intensity  levels  taken  from  a  GE  TN22U0  solid 
state  camera  The  camera  was  displaced  roughly  in 
the  general  direction  of  its  z  axis  between  two  textured 
objects  towards  a  textured  background  and  then  rotated 
about  its  y  axis  a  few  degrees,  figure  5c  shows  the 
displacements  determined  for  a  set  of  interesting  points 
extracted  from  the  image  in  figure  5a  using  the  interest 
operator  described  in  Lawton  (1983)  The  displacements 
were  found  by  correlating  5x5  pixel  windows  centered 
at  these  positions  in  the  first  image  with  5x5  pixel 
windows  positioned  at  locations  within  t-  15  pixels  in 
the  x  and  v  directions  in  the  succeeding  image 
Displacements  for  points  within  10  pixels  of  the  image 
boundary  were  ignored 


Figure  4  a 


Figure  4b 
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Abstract  „  . 


we  briefly  mention  some  of  the  existing  approaches 
for  matching.  In  section  V  we  present  lesults. 


The  facet  approach  requires  low  level  image 
processing  techniques  to  he  hased  on  fitting  each 
local  image  neighborhood  with  a  function  and 
interpreting  all  processing  in  terms  of  what  the 
processing  does  to  this  locally  fit  function. 
Using  the  facet  approach  We  develop  a  different 
meaning  of  the  usual  optic  f’low  equation,  fe  show 
that  it  represents  the  intersection  line  of  the 
isocontour  plane  with  a  successive  image  frame. 
The  intersection  line  on  the  successive  frame 
contains  the  possible  match  points.  A  unique 
match  point  can  be  selected  hy  requiring  it  to 
have  the  same  hrightness  as  the  given  pixel.  We 
show  that  this  procedure  amounts  to  assuming  that 
all  derivatives  of  third  or  higher  order  are 
negligible  and  that  gray  tone  intensity  and  first 
Partial  derivatives  in  row,  column,  and  time  must 
match. 


II  Optic  Flow  Geometry 

Consider  an  image  created  hy  a  camera  in 
constant  motion,  the  velocity  of  the  camera  heing 
(a  ,  a  ,  a  )  in  the  x,y,  and  z  directions 
respectively.1  The  motion  of  the  camera  causes  the 
position  of  pixels  in  the  image  to  move.  An  image 
in  which  each  pixel  contains  the  velocity  vector 
describing  the  motion  of  that  pixel  is  called  the 
optic  flow  image.  We  give  a  brief  derivation  of 
the  optic  flow. 

Our  perspective  geometry  model  places  the  lens 
at  the  origin  looking  down  the  y-axis.  The  image 
plane  is  a  distance  of  f  in  front  of  the  lens. 
Thus  a  point  (x.y.z)  in  the  3D  world  will  have  an 
x-position  x^  on  the  image  given  by 

(x-a  t) 

x  -  f - 2—  (1) 

p  (y-a  t) 

y 


I  Inti  oduction 

In  this  paper  we  consider  the  case  of  a  camera 
in  uniform  translational  motion  in  a  static  scene. 
In  section  II  we  derive  the  optic  flow  geometry 
equations  for  uniform  translational  motion  and 
show  how  from  the  optic  flow  field  it  is  possible 
to  compute  the  camera  velocity  parameters  and  the 
depth,  hoth  to  within  an  arhitrary  scale  factor. 

Determination  of  the  optic  flow  field  it 
usually  done  by  matching  corresponding  points  on 
successive  image  frames.  This  kind  of  technique 
suffers  from  a  potentially  expensive  combinatorial 
complexity  prohlem.  In  section  III  we  apply  a 
facet  model  technique  to  the  problem  of  estimating 
the  optic  flow  field.  We  show  how  the  first  order 
derivative  optic  flow  equation  represents  the 
intersection  line  of  the  isocontour  plane  with  a 
successive  image  frame.  To  select  a  unique  match 
point  on  this  line  we  require  that  the  gray  tone 
intensities  match.  We  show  that  this  procedure 
amounts  to  requiring  that  gray  tone  intensities 
match  and  first  order  partials  in  row,  column, 
and  time  match.  The  complexity  of  the  technique 
is  linear  in  the  numher  of  pixels  on  the  image. 
There  is  no  combinatorial  matching.  In  section  IV 


At  t=0,  a  point  (x',z')  on  the  image 


corresponds  to  the 


ray 


where 


A,  the  unknown  parameter,  is  most  directly 
related  to  the  depth  y  of  the  3D  point  by  the 
relation  A  «  y/f.  After  substituting  Ax'  for 
x  and  A  z'  for  y  in  equation  (1)  there  results 


(Ax'  -  a  t) 

x  =  f - -  (2) 

P  (Af  -  a  t) 

y 


The  velocity  u(x',z')  of  point  (x',z')  at  t"0  can 
be  obtained  as 

3x  (Af-a  t ) ( —  a  )-(Ax'a  t ) ( —  a  ) 

P  y  x  x  y 

-  f  - - (3, 

3t  (Af-a  t) 


Tigure  5d  shows  [he  average  difference  vectors 
wh«ch  resulted  from  setting  the  separation  threshold  to 
10  pixels  and  the  length  threshold  to  3  pixels.  A  plot 
of  the  error  function  produced  using  these  threshold 
values  is  shown  in  figure  5e.  The  local  search  found  a 
minimum  at  (52,  75).  The  correct  position  of  the 
intersection  of  the  translational  axis  with  the  image 
plane  for  the  second  image  was  determined  to  be  at 
(57  97,  74.58)  Since  the  focal  length  was  rather  long, 
the  determined  translational  axis  was  well  within  5 
degrees  of  ihc  actual  one 
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Figure  Sd 
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evaluated  at  t»*0,  Then,  we  have 


u(x ’ , z’ ) 


at 


t"0 


-*  +  x. 

X  Xf 


(4) 


The  case  for  the  z-velocity  v(x'.z')  is  similar 


v(i’,z') 


az 


at 


-a  a 

z 


't»0 


Xf 


(5) 


For  a  camera  in  motion  where  a,  4  0,  there 
will  be  one  point  (x-.z.)  on  the  image  whose 
notion  will  be  zero.  To  determine  this  point  set 


az  dz 

— E  .  — E  ,  o  at  t*0  (6) 

at  at 

to  obtain 


This  point  is  called  the  focus  of  ezpansion  or 
contraction  depending  on  whether  the  eamera  is 
moving  toward  or  away  from  the  scene. 


To  solve  for  a  ,a  ,a  ,  set  the  parameter 
X(i  ’ , z' )  to  an  in* t  ill  1  Appropriate  constant 
depending  on  scale  and  solve  in  a  least  square 
sense  the  system  of  equations: 


X(z',z')u(z',z') 


-a  ♦  *■ 

*  f 


(8) 


X(z  ' , z’ ) v(x ’ ,z' )  -  -a  +  -^  z' 
1  f 


This  yields 

^  1  ^  X(x  ’  ,z’ ) tu(x ' , z' )z '  + v(x  ' , z' ) z' I 
u<x\z')X(x',*’>  ^  x’ 

v(x\z’)X(x\z')^  z' 

y  }  <x'2+z'2)  ^  1  -  x-)'2  -  (,  z’)L 


(9) 


JL 


^  X-  -  ^  u(x  ,z')X(x’,z’) 

T~ 


a 

JL 


^  z  ’  -  J  v(x'.z')X(x'.z*) 

. F 


where  all  summations  are  over  all  (x’.z’)  in  the 
image  domain.  If  the  seal*  constant  for 

X(x’  z’)  "  k,  an  unknown  constant,  the  veloelty 
components  a  ...  «nd  a  will  .1  1  have  the  same 
multiplicative  Constant  1.  In  this  case,  th 
velocity  magnitude  is  not  determined,  but  its 
direction  is. 


A  better  solution  than  the  assumed  constant  X 
may  be  obtained  by  iterating  for  redueed  residual 
error  by  redefining  X  to  be  a  function  of  the 
estimated  velocities. 


X  (  x  ’  ,  z  ’  )  “ 


and  then  solving  for  a  ,a  ,  and  a  in  terms  of  the 
new  X(x'.z’).  This  new  yX  ean  be  substituted 
into  equation  (9)  for  a  better  estimate  of  the 
velocities.  Smoothness  in  3D  surfaee  can  be 

insisted  upon  by  taking  any  solution  X, 
considering  it  as  an  image  and  performing  a  slope 
facet  iteration  on  it  (Mural  let  and  Watson,  1981). 


Ill  Calculation  of  Optie  Flow  From  Image  Sequence 

In  this  section  we  discuss  the  calculation  of 
optic  flow  in  a  time  sequence  of  image  frames  and 
illustrate  the  facet  approach  to  the  optic  flow 
computation. 


Consider  the  case  of  a  one  dimensional  sequenee 
of  frames  as  shown  in  figure  1.  These  frames  are 
obviously  translates  of  one  another  with  a  uniform 
motion.  Instead  of  considering  a  correlation 
search  to  match  each  point  on  one  frame  with  its 
corrresponding  place  on  the  next  frame,  consider 
the  sequence  of  frames  as  an  image  each  of  whose 
rows  correspond  to  one  frame.  Corresponding 
points  on  different  frames  have  the  same 
intensity.  Thus  where  the  one-dimensional  frames 
are  organized  as  an  image,  the  corresponding 
points  will  be  on  equal  intensity  contour  lines  as 
shown  on  figure  2.  The  equal  intensity  eontour 
line  any  point  is  on  is  easily  computed  as  the 
line  orthogonal  to  the  gradient  direction  at  that 
point.  Thus  by  fitting  a  function  to  the  image 
intensities  in  a  local  neighborhood  about  a  point, 
as  the  facet  model  prescribes,  and  determining  the 
gradient  direction  from  the  fit,  the  equal 
intensity  contour  line  through  the  point  ean  be 
determined.  The  match  point  on  the  next  frame  can 
be  obtained  without  any  seareh  just  as  the 
intersection  of  the  equal  intensity  contour  line 
passing  through  the  point  with  the  next  frame  or 


gure  .  Illustrates  a  one-d imenslonal 
waveform  which  is  translating  in 
t  ime . 
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Figure  2.  Shows  the  eqnal  intensity  contour 

lines  which  match  corresponding 
points. 

In  a  time-varying  image  sequence,  the  situation 
i*  similar,  only  the  geometry  is  in  s  One 
dimensional  higher  space,  a  4-dimensional  space. 
To  understand  this  geometry  fix  attention  on  one 
pixel  on  one  frame.  Use  the  3D  neighborhood  (hy 
row,  hy  column,  by  image  frame  numher)  aronnd  the 
pixel  and  fit  a  function  to  the  grsy  tone 
intensities  in  the  3D  ne ighhorhood  .  From  the 
function  fit  determine  the  gradient  vector  at  the 
given  pixel  position.  The  plane  which  is  normal 
to  the  gradient  vector  is  the  equal  intensity 
contour  plane  passing  throngh  the  given  pixel.  To 
determine  possihle  match  points  on  the  neit  frame 
intersect  the  eqnal  intensity  contour  plane  with 
ihe  next  image  frame.  As  shown  in  figure  3,  the 
intersection  is  a  line.  The  match  point  can  he 
any  place  on  this  line.  To  determine  it  uniquely, 
find  that  position  on  the  line  whose  gray  tone 
intensity  is  equal  to  the  graytone  intensity  of 
the  given  pixel . 


t  l 


Figure  3.  Shows  a  sequence  of  fine  frames,  the 
gradient  vector  on  frame  t-0,  and 
the  plane  orthogonal  to  the  gradient 
vector  and  passing  throngh  the 

origin.  The  shaded  area  represents 
portions  of  frames  to  the  left  of 
the  cutting  plane.  The  lines  on 

frames  t"-l,  t*0,  and  t»l  represent 
the  intersection  of  the  cutting 
plane  with  these  frames, 

III . 1  Example 

Consider  a  local  2D  neighborhood  whose  gray  tone 
intensity  function  appears  like  a  paraholid  of  the 
form  (r+1)  +  (c+2)  Suppose  that  image  frames 

are  taken  each  second  and  that  due  to  the  camera 

motion  the  paraholid  translates  each  successive 
frame  by  three  rows  and  one  column.  Then  upon 
fitting  the  gray  tone  intensities  in  a  local  3D 
neighborhood  whose  center  pixel  has  coordinates 
(0,0,0)  in  a  relative  coordinate  frame  we 
determine  the  function 

f ( r, c ,  t )  -  (r-3t  +  l)2  +  (c  +  t+2)2 

Thus  the  paraholoid  ia  translating  hy  (3  rows.-l 
column)  on  snccessive  frames. 


The  partial  derivative!  of  f  are 


The  equation  of  the  isodennity  contour  plane 
passing  through  (0,0)  is  given  by 

--  -  2(r~3t+l )  , 

dr  (r-at)  gf  4  < c -f, t )  gc  -  0  (12) 


df 

--  -  2(c+t+2) 
dc 

dl 

—  -  2(r-2t  +  l ) (-3)  +  2(c+t+2 ) 
at 

Fvaluat ing  these  partials  at  (0,0,0)  yields  the 
gradient  vector  at  the  given  pixel  which  is 
located  at  the  origin 


3g 

where  g  «  —  (0,0) 

f  dr 

3g 

g  -  --  (0,0) 

ec 

At  1  “  lQ  thi*  plane  cuts  the  t  »  t„  frame 
producing  the  line  0 


grad  f 


<r-QV  *r  +  gc  *  0 

At  the  desired  point  (r,c)  on  this  line  we  aiust 
satisfy  the  ritch  condition 


The  plane  passing  through  (0,0,0)  and 
orthogonal  to  this  gradient  vector  is  given  by 

2r  +  4c  -  2t  «  0 

Intersecting  this  plane  with  the  next  frame  <t-l) 
produces  the  line 


g (r-at0 ,  c-ptQ)  •  g (0 ,0) 

Assuming  g  is  a  function  for  which  all  partial 
derivatives  exist  *e  may  represent  g(r-at, 
c-pt)  by  its  Taylor  series  around  (0,0) 


2r  +  4c  -  2  -  0 


g(r-at.c-pt)  -  g (0,0)  +  (r-at  )  g  +  (c-pt  )g 

Or  0  fc 


The  gray  tone  intensity  at  (0,0,0)  is  given  by 
f(0,0,0)  »  J.  To  find  the  match  point,  find  that 
(r,c)  simultaneously  satisfying  the  two  equations 


r  +  2c  -1  -  0 

f(r,c,l)  »  ( r-2 ) 2  +  ( c  +3 ) 2  -  f(0,0,0)  -  5 

Substituting  r  -  1  -  2c  into  (r-2)2  +  (c+3)2  -  J 

yields  the  quadratic  equation  (c+1)2  ■  0  from 

which  c  «  -1  and  r  ■  3,  the  correct  translation 

parameters. 

III. 2  Translational  Motion 


As  the  example  suggests,  the  difficulty  of  the 
computation  might  be  in  determining  a  real  root  of 
a  polynomial.  It  is  natural  to  wonder,  therefore, 
whether  it  is  possible  to  have  polynomials  with  no 
real  roots.  We  demonstrate  here  that  for  the  cate 
°f  iransl ational  motion,  there  is  no  possibility 
of  the  polynomial  roots  being  only  complex.  To 
see  this  express  the  local  fitted  functions 
f  ( r ,  c ,  t )  as 

Kr.c.t)  -  g  (r-at,  c-pt)  (11) 

explicitly  indicating  that  the  dependence  between 
r,  c,  and  t  is  constrained  to  translation. 


(r_aV 


<c-  Pt  )‘ 

8rr  +  (r-at0)(c-Pt0)*rc+ - *c 

2 


Substituting  g(r-at,  c-pt)  for  g<0.0)  and 
substituting  -  <e-Pt0)gc/gr  for  r-oto  yieidt 


(c-pt 


c  re  + 

r'r 


% 


^  g-u  (c-ptn)JI...) 


+  ...+  0  (13) 


Factoring  ont  a  (c-pt  )2  from  the  left  hand 
side  and  noting  that  the  right  hand  aide  ia  zero 
permits  us  to  write  (c-pt  )Z  -  0  from  which  ve 
can  sove  for  the  double  real  root  c  *  Pt 

0 

III. 3  Comparison 


There  is  a  relationship  between  this  procedure 
and  the  usual  optic  flow  eqnation.  Letting  f  , 
I,  und  f{  designate  the  partial  derivatives  ofrf 
with  respect  to  r,c»  and  t,  evaluated  at  the 
origin,  the  equation  of  the  isocontour  plane  is 


given  by 


r  f(  +  c  fc  +  tf(  -  0  (14) 

Intersecting  this  plsne  with  the  next  image  plane 


which  is 

taken  at 

t  seconds  latter 

produces  the 

1  ine 

-f,  - 

r  c 

-  f  +  -  f 

(15) 

Fqua t ion 

t 

(15)  is 

‘o  f  ‘o  C 

the  usual  optic 

flow  equation 

(Horn  and  Schunk,  1980).  Tue  quantity  r/t 
represents  a  movement  of  r  rows  over  t  seconds 
and  is  therefore  the  row  velocity.  Likewise  c/t 
represents  the  coiumn  velocity.  ® 

The  difference  in  what  we  have  done  it  that  we 
have  given  equation  (15)  an  enlarged  meaning.  It 
is  the  equation  of  a  line  containing  the  possible 
match  points  on  the  t^  image  frame.  But  since  the 
match  point  must  have  the  tame  brightness,  we  use 
the  additional  constraint  that  the  match  point 
(r,c)  must  satisfy 

f(r,c,t.)  -  f(0,0,0)  (16) 

the  equal  brightness  constraint.  This  hrightnett 
constraint  is  nsed  in  the  usual  derivation  of  the 
optic  flow  equation  so  it  wouid  seem  to  be 
snperflnout  to  use  again.  From  our  perspective  we 
see  that  the  isodensity  contour  plane  ia  reaiiy 
oniy  isodensity  at  th'  origin  and  aa  it  movea  away 
from  the  origin,  it  muat  be  regarded  aa  an 
approximation.  Thus  the  intersection  line  on  the 
successive  frame  is  not  guaranteed  to  have  all  ita 
pointa  he  of  the  same  hrightnesa  aa  the  given 
pixel.  The  match  condition  just  tells  us  to 
select  that  point  on  the  line  having  the  same 
hrightnesa  as  the  given  pixel. 

111.4  Why  It  Works 

In  this  section  we  give  a  detailed  explanation 
of  why  the  procedure  works.  We  assume  that  all 
derivatives  of  third  or  higher  order  are 
negiigihle  and  that  the  match  conditions  consist 
of  matching  gray  tone  intensity  and  gray  tone 
first  partiais  in  row,  columns,  and  time. 

Let  f  with  a  subscript  designate  the 
corresponding  partial  derivative  of  f  evaiuated  at 
r-c-t-0.  A  Taylor  series  of  f  ahout  (0,0,0) 
neglecting  third  or  higher  order  terms  is  given  by 


Hr.c.t)  -  f (0,0,0)  ♦  rf  ♦  cf  +  tf 

ret 


(17) 


T  L  2 
+  f  +  ref  +  “f  *■  rtf  +  ctf  +  --{ 

2  rr  rc  2  cc  rt  ct  2  tt 


A  pixel  ( r, c )  having  relative  neighborhood 
coordinates  on  relative  time  image  t  matches  pixel 
(0,0)  on  time  image  0  if 


( 1 )  f ( r,c, t ) 

(2)  3 f 

—  (r.c, t ) 
dr 

(3)  3 f 

—  (  r, c , t ) 
3c 

(4)  3 f 

—  (r,c,  t) 
3t 


f<0,0.0) 

3  f 

— (0,0,0)  -  f 
3r  r 

3  f 

— (0,0,0)  -  f 
3c 

3f 

— (0,0,0)  -  f 
3t  ‘ 


Condition  (1)  states  that  the  gray  tone 
intensities  must  match.  Condition  (2)  and  (3) 
states  that  the  gray  tone  spatial  pattern  around 
the  original  and  the  match  pixel  must  match. 
Condition  (4)  states  that  since  the  motion  is 
uniform  with  no  acceleration  the  gray  tone  time 
derivatives  must  match. 

Applying  these  constraints  to  the  Tayior  series  we 
have,  respectively. 


(18) 


rfr  +  cfc  +  tft  +  “frr  +  "frc  +  ~{  cc  +  rtfrt 


+  ctf  +  — f  ■  o 
ct  2‘tt 


rf 

♦  cf 

+  tf 

o 

09  ) 

rr 

rc 

rt 

n 

rc 

+  cf 

cc 

♦  tr¬ 
et 

0 

(20) 

rf 

+  cf 

+  tf  • 

o 

(21) 

r  t 

ct 

tt 

Multiplying 

eqnation  (19) 

by  r, 

equation 

(20) 

hy 

c,  eqnation 

(21)  hy  t  and 

adding 

yields 

2  2 

r  f  +2rcf  +c  f  +  2rtf  +2ctf  ,,,, 

rr  re  cc  rt  ct  (22) 


+‘2ftt  '  0 


Subs t i tnt ing  this  hack  into  equation  (18)  yields 

(2  1) 


Ifr  +  cfc  +  »ft  -  0 


the  usnai  optic  fiow  equation)  Thus,  the 
technique  of  nsing  eqnation  (23)  and  the  gray  tone 
intensity  match  condition  (18)  in  essence  works 
becanse  it  assumes  that  all  first  partiais  are 
matching.  However,  now  we  see  that  there  need  not 
be  any  problem  of  root  finding.  We  just  need  to 


solve  the  overconstrs ined  system  of  equations 


for  the  row  column  position  (r.c)  on  the  specified 
image  t. 

IV  Brief  Matching  Literature  Review 

Matching  frames  is  an  old  image  processing 
problem.  Classically.  it  was  solved  by 
translating  one  image  against  the  other  until  the 
correlation  between  the  two  images  was  highest. 
An  equivalent  calculation  can  be  done  through  the 
use  of  Fourier  Transform.  Barnea  and  Silverman 
(1972)  showed  how  to  speed  up  the  search  by 
essentially  not  doing  calculations  on  positions 
where  errors  must  exceed  the  best  error  so  far. 
Three  techniques  work  only  for  translation  of  one 
imaf e  relative  to  the  other. 

In  moving  images,  the  motion  is  not  the  same 
all  over  the  image.  Correlation  techniques  are 
not  appropriate.  Martin  and  Aggarwal  (1979)  nse 
bonndary  information  at  the  basis  for  matching. 
Bernard  and  Thompson  (1980)  nse  a  disparity 
analysis  technique  for  matching.  Ayala,  Orton, 
Larson,  snd  Elliott  (1982)  use  a  symbolic 
technique  for  matching.  Jacobus.  Chien,  and 
Selander  (1980)  use  a  graph  matching  technique. 
Aggarwal.  Davis,  and  Martin  (1981)  review 
techniques  for  establishing  corresponding  points 
on  images.  The  problem  with  most  of  these 
techniques  is  that  they  must  employ  some 
combinatorial  computation  to  establish  the  match. 
This  kind  of  compntation  is  very  expensive. 

Techniques  which  do  not  involve  combinatorial 
matching  inclnde  Limb  and  Murphy  (1975)  who  relate 
Image  intensity  changes  over  time  to  spatial 
gradient  and  Fennema  and  Thompson  (1979)  who  nse  a 
gradient  Intensity  transform  method.  Both  these 
techniques  are  similar  to  the  one  presented  fn 
this  paper  in  that  they  establish  the  match  using 
only  local  neighborhood  analysis. 

V  Resnlts 

To  confirm  that  our  theory  works,  we  tested  onr 
algorithm  on  3  kinds  of  image  sequences.  The 
image  sequences  describe  the  movement  of  an 
ellipsoid  in  translation,  magnification,  and 
rotation.  The  time  interval  between  two 
consecntive  images  in  a  sequence  corresponds  to 
one  pixel  difference  in  an  image. 

To  compute  an  optic  flow  vector  of  a  pixel  on 
the  image  at  t"0,  we  determined  the  underlying 
fnnction  over  its  3-D  neighborhood  nsing  a  3-D 
cubic  discrete  orthogonal  polynomial  basis,  and, 
next,  derived  the  4  constraining  eqnations  (Eq24) 
on  the  row  and  column  components  of  the  optic  flow 


vector  at  the  center  of  the  pixel.  To  solve  the 
over-constrained  equations,  we,  first,  obtained 
two  singular  values  using  the  Singular  Value 
Decomposition  routine  of  the  1  inpack,  and,  next, 
determined  the  least  square  solution  from  the 
singular  values. 

To  the  time  sequence  of  an  ellipsoid  moving 
with  the  velocity  of  r/t-1  and  c/t— .8  shown  in 
Fig  4,  we  applied  the  above  method  with  the  3-D 
neighborhood  (5x5x5)  and  obtained  the  optic  flow 
image  shown  in  Fig  5.  At  the  pixels  on  or  near 
tlu  boundary  of  the  ellipsoid,  the  optic  flow 
ve< tor  obtained  does  not  show  the  correct  movement 
of  the  ellipsoid  because  neighborhoods  contain  a 
mixture  of  stationary  background,  moving 
ellipsoid,  thereby  providing  inconsistent 
information  for  f  tting.  The  reason  for  the 
inconsistency  it  that  the  center  pixel  may  be  in 
the  stationary  background  but  it  has  neighbors 
which  are  not.  These  neighbors  generate  an 
estimated  turftce  which  are  not.  These  neighbors 
generste  an  estimated  surface  which  hat  tome 
curvature  for  the  center  pixel. 

To  reject  an  optic  flow  vector  obtained  from 
such  a  neighborhood,  we  compute  the  ratio  of  two 
principal  curvatures  from  the  underlying  gray  tone 
intensity  surface  determined  at  t*0.  From  the 
histogram  of  this  curvature  ratio  over  all  the 
neighborhoods  in  the  image  sequence  shown  in  Fig 
6,  we  can  determine  a  threshold  value  for  the 
ratio  at  about  0.05.  Fig  7  illustrates  the 
result.  The  pixels  which  still  hsve  incorrect 
directions  correspond  to  neighborhoods  with  large 
fitting  errors  over  the  center  pixel.  Fig  8 
illustrates  a  histogram  of  the  center  pixel 
fitting  error.  Thresholding  the  original  optic 
flow  image  with  the  curvature  ratio  of  .05  and 
rejecting  the  optic  flow  vectors  obtained  from  the 
underlying  function  having  fitting  error  of  more 
than  1,  we  have  the  optic  flow  image  shown  i -  Fig 
9.  Rejecting  the  vectors  having  fitting  error  of 
more  than  .01,  we  have  the  optic  flow  image  shown 
in  Fig  10. 

For  the  time  sequence  of  the  ellipsoid  moving 
backwards  with  the  magnification  factor  .95  shown 
in  Fig  11,  we  obtain  the  optic  flow  image  shown  in 
Fig  12  where  we  thresholded  the  original  optic 
flow  image  with  the  ratio  .05  and  rejected  the 
vectors  obtained  from  underlying  fnnction  having 
fitting  error  of  more  than  1.  In  the  same  way, 
for  the  time  sequence  of  the  ellipsoid  rotating 
clockwise  with  the  angnlar  velocity  .1  radian 
shown  in  Fig  13,  we  obtain  the  optic  flow  image 
shown  in  Fig  14. 
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Fig  4  Time  sequence  of  an  ellipsoid  in  translation 
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Fig  6  Histogram  of  curvature  ratio 


Fig  5  Original  optic  flow  image 
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Fig  7  Thresholded  optic  flow  image 
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Abstract 

A  method  for  the  automatic  construction  of  fast 
special  purpose  vision  programs  is  described.  The  start- 
mg  point  for  Llic  automatic  construction  process  is  a 
description  of  a  particular  3()  object.  The  result  is  a 
fast  special  purpose  program  for  recognizing  and  locat¬ 
ing  that  object  in  images,  without  restriction  on  the 
orientation  of  the  object  in  space.  The  method  has 
been  implemented  and  tested  on  a  variety  of  images 
with  good  results.  Sonic  of  the  tests  involved  images 
in  which  the  target  objects  appear  in  a  jumbled  pile. 
The  current  implementation  is  not  fully  optimized  for 
speed.  However,  evidence  is  given  tliaL  image  analysis 
times  on  the  order  of  a  second  or  less  can  he  obtained 
for  typical  industrial  recognition  tasks.  (This  time  es¬ 
timate  excludes  edge  finding). 


1.  Introduction 

In  many  practical  applications  of  automated  vision, 
the  vision  task  takes  the  form  of  recognizing  and  locat¬ 
ing  a  particular  three  dimensional  object  in  a  digitized 
image.  The  exact  shape  of  the  object  to  be  perceived  is 
known  in  advance;  the  purpose  of  the  act  of  perception 
is  only  to  determine  its  position  and  orientation  relative 
to  the  viewer.  I  his  is  model  based  vision  in  its  strict 
form. 

Most  industrial  applications  of  vision  have  this 
property,  ami  also  the  property  that  the  same  object 
(or,  more  precisely,  objects  of  the  same  shape),  must 
be  located  in  many  images.  In  this  kind  of  situation, 
it  is  desirable  to  split  the  computation  into  two  stages: 
an  analysis  or  precomputation  stage,  in  which  useful 
information  about  the  (unchanging)  object  is  compiled, 
and  an  execution  or  runtime  stage,  in  which  this  in¬ 
formation  is  exploited  for  the  rapid  recognition  of  the 
object  in  an  image.  The  reason  for  breaking  up  the 
computation  in  this  way  is  of  course  that  the  analysis 
only  needs  to  be  done  once,  whereas  its  results  can  be 
exploited  repeatedly,  Bollos[l!olles  1982]  among  others, 
has  taken  this  general  appioach  to  the  model  based  vi¬ 
sion  problem. 

The  advance  analysis  stage  may  take  a  variety 
of  forms.  In  our  work,  this  stage  involves  a  kind  of 
automatic  programming.  A  description  of  the  object 
to  he  recognized  is  “compiled”  into  a  special  purpose 


program  whose  only  function  is  to  recognize  that  one 
object  in  digitized  images.  In  the  second,  runtime  stage, 
individual  images  are  processed  by  the  special  purpose 
program  produced  at  the  first  stage. 

1  his  formulation  of  the  work  accomplished  by  the 
advance  analysis  is  very  unrostrictive.  It  makes  no  coin 
Tiiitment  as  to  the.  algorithm  which  is  used  to  process 
images;  rather  the  algorithm  may  be  chosen  according 
to  the  object  which  is  to  be  recogni  cod.  The  problem  of 
linding  the  best  algorithm  among  all  algorithms  for  a 
given  object  is  intractable.  However,  we  may  attempt  to 
construct  special  purpose  algorithms  for  object  recog¬ 
nition  within  a  restricted  class  of  algorithms,  and  hope 
for  good,  though  not  optimal,  results. 

In  this  paper,  we  describe  a  method  lor  automati¬ 
cally  constructing  special  purpose  programs  for  3D  ob¬ 
ject  recognition.  To  be  precise,  we  mean  by  81)  ob¬ 
ject  recognition  the  recognition  of  three  dimensional  ob¬ 
jects  in  ordinary  light  intensity  images  (not,  eg,  range 
images),  where  no  restriction  is  made  on  the  orientation 
of  (lie  object  with  respect  to  the  camera.  Although  we 
speak  of  recogni  (ion,  the  process  of  recognition  delivers 
information  not  only  about  the.  presence  or  absence  of 
the  object  in  the  image,  but  also  the  position  and  orien¬ 
tation  of  the  object  if  it  is  present.  The  method  'dies 
on  maichiiig  object  features  to  image  edges.  We  will 
not  concern  ourselves  here  with  how  image  edges  are 
extracted  irom  pixel  data.  The  method  does  not  rely  on 
perfect  results  from  the  edge  linder  (if  it  did,  it  would  be 
of  no  practical  interest).  The  special  purpose  programs 
generated  arc  quite  fast.  To  give  a  very  rough  idea 
of  how  fast,  our  method  should  allow  the  recognition 
of  ordinary  industrial  objects  in  moderately  complex 
background  in  a  second  or  less  on  a  I  MU’  computer; 
this  is  the  time  required  for  the  matching  process,  and 
excludes  the  time  required  for  edge  finding.  The.  data 
on  which  this  kind  of  general  speed  estimate  is  based 
will  be  given  later  in  the  paper. 


2.  A  general  strategy  for  special  purpose 
rutomatic  programming 

The  general  features  of  the  mod  :l  based  vi ,  i>>u 
problem  which  make  it  a  candidate  for  special  purpose 
automatic  programming  are  shared  by  a  wide  variety  of 
computational  problems.  The  features  in  question  are 
that  tire  inputs  to  the  computation  are  delivered  in  two 
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stages,  arid  that  for  each  first  stage  input  (here,  the  ob¬ 
ject  (o  he  recognized),  many  sccoi  d  stage  inputs  (here, 
image's)  must  bo  treated.  The  spec  ial  purpose  automatic 
programming  problem  in  its  general  form  can  he  stated 
as  follows. 

Let  us  suppose  that  a  function  /  of  two  inputs  x 
and  y  must  he  computed  repeatedly  under  conditions 
whc  re  many  values  of  y  must  he  treated  for  each  value  of 
x.  flu  u  we  attempt  to  devise,  for  each  x,  a  special  pur¬ 
pose  program  /’,  with  Px(y)  f(x,y).  More  precisely, 
what  is  wanted  is  an  automatic,  process  for  construct¬ 
ing  the  programs  l’x  -  that  is,  synthesis  method  M  with 
M(x)  =,  Px. 

I'lie  informal  strategy  which  we  use  for  the  model 
based  vision  problem  ran  also  he  described  at  this  level 
of  gc  nerality.  The  strategy  is  just  that  of  starting  with 
a  general  purpose  program  for  doing  the  computation 
in  one  stage,  and  using  specialized  variants  of  this  pro¬ 
gram  us  the  template's  from  which  special  purpose  pro¬ 
grams  are  developed.  Suppose,  again,  that  f(x,y)  must 
he  computed  repeatedly  with  slowly  changing  x  and 
rapidly  changing  y.  Suppose  also  that  we  have  in  hand 
a  program  G(x,y)  which  computes  f(x,y)  in  one  stage. 
We  begin  by  “unwinding”  G  as  it  applies  to  the  value 
x  lor  which  a  special  purpose  program  Px  is  wanted; 
this  unwound  program  G'z(y)  is  then  optimized  to  get 
the  desired  result.  'I’lie  unwinding  is  done  by  a  kind  of 
symbolic  execution,  in  which  loops  arc1  unwound  when 
possible,  and  recursions  arc  unfolded.  If  the  procedures 
for  unwinding  and  optimization  are  mechanical  in  na¬ 
ture,  then  together  they  constitute  what  we  have  called 
a  synthesis  method  for  /.  The  unwinding  and  optimiza¬ 
tion  procedures  need  not  be  applicable  to  arbitrary  pro¬ 
grams;  they  can  be  custom  designed  for  the  particular 
program  G  at  hand. 

In  general,  the  program  G  does  not  have  to  he 
completely  specified  -  G  may  itself  be  a  template  for 
an  algorithm,  with  many  details  left  out.  The  details 
may  he  filled  in  after  -  rather  than  before  -  G  has  been 
unwound. 

In  our  work  on  vision,  we  begin  with  a  one  stage  al¬ 
gorithm  template  G(x,  y)  which  takes  an  object  descrip¬ 
tion  x,  and  an  image  y,  and  identifies  instances  of  x  in  y. 
G  is  a  template  for  an  algorithm  in  the  sense  of  the  last 
paragraph;  many  of  the  details  of  its  operation  will  he 
specified  only  after  it  has  been  unwound  for  particular 
objects. 

Model-based  vision  is  the  second  problem  to  which 
we  have  applied  this  style  of  special  purpose  automatic 
programming.  [(load  82]  describes  the  earlier  ap¬ 
plication  to  hidden  surface  cfimiiiaiiou  in  31)  com¬ 
puter  graphics,  and  also  contains  a  general  discussion 
of  special  purpose  automatic  programming  as  it  re¬ 
lates  to  other  work  on  the  automatic  construction  and 
manipulation  of  programs. 


3.  A  one  stage  algorithm  for  model  based  vi¬ 
sion 

'I’lie  one  stage  algorithm  G(x,  y)  which  we  start 
with  is  a  simple  sequential  matching  procedure.  The 
kind  of  object  description  expected  by  G  is  a  list 
of  object  leatures,  together  with  conditions  on  their 
visibility,  for  the  current  purposes,  an  object  feature 
is  taken  to  he  a  curve  along  the  object  surface  at  which 
cither  a  surface  normal  or  a  rellectivity  discontinuity 
occurs.  We  will  restrict  ourselves  to  straight  line  seg¬ 
ments  rather  than  considering  arbitrary  curves.  So,  in¬ 
formally,  an  object  feature  is  just  a  straight  edge  on 
the  object.  For  each  object  feature,  G  also  needs  to 
know  the  range  positions  in  space  from  which  that  fea¬ 
ture  is  visible  (means  for  representing  this  information 
will  hi'  described  later).  There  is  no  need  for  the  list  of 
features  making  up  an  object  description  to  be  exhaus¬ 
tive;  it  is  sullicieiit  that  enough  features  be  included 
to  make  reliable  recognition  possible.  As  a  result,  the 
kind  of  description  of  an  object  which  is  needed  for  its 
recognition  is  much  les.:  extensive  than  that  needed  for 
displaying  it. 

'i’lie  image  description  expected  by  G  is  of  the  same 
general  kind  as  the  object  description:  it  is  a  list  of 
features.  In  particular,  the  image  features  employed 
by  (]  are  of  exactly  the  kind  which  the  object  features 
give  rise  to  in  the  imaging  process:  they  are  straight 
segments  in  the  image  along  which  an  intensity  dis¬ 
continuity  occurs.  The  process  by  which  this  kind  of 
image  description  is  generated  from  raw  pixel  data  will 
not  be  discussed  in  this  paper.  In  our  experiments, 
we  used  an  edge  detection  program  written  by  David 
MarinioulfMitrimoul  1982]  The  program  convolves  the 
image  with  a  lateral  inhibition  operator,  detects  zero- 
crossings  in  the  laterally  inhibited  image,  and  then  per¬ 
forms  linking.  Straight  edges  are  arrived  at  by  applying 
a  simple  segmentation  scheme  (w-  added  this  last  step 
to  Marimout’s  algorithm). 

Although  we  will  r-strict  ourselves  in  this  paper  to 
treating  edge  features,  the  same  methods  would  apply 
to  any  kind  of  object  feature  which  gives  rise  in  a 
predictable  way  to  an  image  feature. 

I  he  operation  of  G  may  be  described  in  general 
terms  as  follows.  G  performs  a  simple  depth-first  search 
for  a  match  between  object  and  image  edges.  At  any 
given  time  in  the  search,  (7’s  state  includes  a  currently 
hypothesized  match  M ,  and  a  current  hypothesis  about 
the  position  and  orientation  of  the  object  relative  to  the 
camera.  The  hypothesis  concerning  the  location  of  the 
object  gives  bounds  on  the  location  parameters,  and 
not  exact  values.  In  the  main  loop  of  the  algorithm, 
G  attempts  to  extend  and  refine  its  current  hypothesis 
by  means  of  the  following  three  steps  (which  fie  at  the 
heart  of  many  algorithms  for  perception): 

(I)  Predict:  An  object  edge  is  selected  which  has 
not  yet  been  matched  by  any  image  feature.  Based  on 
the  current  hypothesis,  I, In'  position  and  orientation  of 
its  projection  in  the  image  is  predicted. 
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(2)  Observe;  The  list  of  image  edges  is  checked  to 
see  whether  any  has  the  predicted  qualities. 

(3)  Hack  project;  If  an  edge  with  predicted  qualities 
was  found  in  step  (2),  then  extend  the  match  to  include 
this  edge,  and  use  the  measured  position  and  orientation 
of  the  edge  to  refine  the  current  hypothesis  as  to  the 
location  of  the  camera. 

The  algorithm  repeats  this  loop  until  either  a  satis¬ 
factory  match  is  found,  or  until  the  algorithm  fails  to 
observe  a  predicted  edge.  In  the  latter  case,  the  al¬ 
gorithm  backtracks  to  the  last  choice  point.  Choice 
points  arise  when  more  than  one  image  edge  appears  in 
a  predicted  position  and  orientation. 

This  is  a  sketch  of  the  algorithm.  Before  supplying 
further  details,  some  definitions  are  needed. 

It  will  be  convenient  to  work  in  object  centered 
coordinates:  The  object  will  be  thought  of  as  fixed, 
and  the  position  and  orientation  of  the  viewer  as  the 
unknown  to  be  determined.  An  object  edge  is  always 
regarded  as  an  oriented  segment  (J^iven  by  the  ordered) 
rather  than  unordcrcd  pair  of  its  end  pom’s),  while  an 
image  edge  may  or  may  not  be  oriented.  'I  he  imaging 
process  is  given  by  the  ordinary  perspective  transfor¬ 
mation.  The  parameters  of  this  transformation  which 
derive  from  the  camera  model  arc  the  distance  from 
the  point  of  projection  to  the  image  plane,  and  the 
field  of  view,  given  by  a  rectangle  on  the  image  plane. 
Assume  that  these  parameters  arc  fixed,  bet  p  be  a 
3D  position,  q  a  31)  orientation,  and  x  an  object  edge. 
Then  |j  (jp, «/],  x)  denotes  the  oriented  image  edge  which 
results  from  viewing  x  from  camera  position  and  orien¬ 
tation  jp,  7].  The  edge  x  may  not  be  visible  from  (p,  9], 
either  because  x  is  occluded,  lies  on  the  “wrong  side” 
of  the  object,  or  because  the  projection  of  x  onto  the 
image  plane  lies  outside  of  the  field  of  view.  In  these 
cases,  il  ([p,  <7],  z)  is  undefined.  We  wilt  write  Jj  (x)  to 
denote  the  projection  of  x  when  the  parameters  [p,  <7]  of 
the  projection  are  clear  from  context.  At  this  point  we 
make  several  assumptions  about  the  imaging  geometry. 

First  we  assume  that,  in  the  images  which  are  to 
be  analyzed,  the  object  sought  is  either  not  visible  at 
all,  or  lies  entirely  within  the  field  of  view.  Second, 
we  assume  that  the  field  of  view  is  sufficiently  narrow 
that  changes  of  the  orientation  parameter  <7  at  a  fixed 
position  p  have  only  negligible  effects  on  the  lengths  of 
projected  edges,  and  the  angles  between  them.  Thus, 
while  a  change  in  q  will  in  general  cause  some  of  the 
image  to  move  out  of  the  field  of  view,  that  part  of 
the  image  which  remains  visible  will  have  undergone 
only  a  2D  rotation  and  translation  -  to  within  a  small 
tolerance.  This  criterion  is  met  in  typical  industrial  im¬ 
aging  situations.  Finally,  we  make  the  more  restrictive 
assumption  that  tile  distance  from  the  camera  to  the 
object  -  or  more  precisely  from  the  perspective  projec¬ 
tion  point  to  the  origin  of  the  object  centered  coor¬ 
dinate  system  -  is  known  in  advance.  (This  restriction 
is  made  to  simplify  the  exposition,  and  docs' not  apply 


to  the  implementation  described  later).  Tims  the  posi¬ 
tion  parameter  p  is  restricted  to  lie  on  a  sphere  about 
the  origin.  Without  loss  of  generality,  we  may  assume 
that  this  is  the  unit  sphere. 

We  will  refer  to  a  set  of  positions  on  the  unit  sphere 
as  a  locus.  A  locus  is  to  be  thought  of  as  a  set  of  possible 
camera  positions,  bet  X  be  an  object  description  and 
Y  an  image  description.  Recall  that  X  is  a  list  of 
edges  together  with  visibility  conditions.  The  visibility 
condition  for  each  edge  e  in  X  is  given  by  the  locus  of 
points  from  which  that  edge  is  wholly  visible.  This  is 
called  the  visibility  locus  of  e.  (Methods  for  representing 
such  loci  will  he  given  later).  Now,  a  match  M  between 
object  edges  X  and  image  edges  Y  is  an  assignment 
of  image  edges  to  object  edges.  A  match  also  assigns 
orientations  to  the  otherwise  nnoriented  image  edges. 
For  any  object  edge,  M(c)  denotes  flic  oriented  image 
edge  (if  any)  assigned  to  it  by  M .  The  assignment  may 
be  partial;  that  is,  for  some  e,  M(c)  may  be  undefined. 

A  match  M  is  consistent  with  a  camera  posi¬ 
tion  and  orientation  jp,  <7]  if  for  each  object  edge  e, 
the  projection  Jj  ([p,q\,e)  =3  M(e)  to  within  errors  in 
measurement.  A  match  M  is  consistent  with  a  camera 
position  p  if  there  is  some  orientation  q  such  that  M  is 
consistent  with  [p,  <7],  A  match  is  consistent  with  a  locus 
Ij  if  it  is  consistent  with  every  position  in  the  locus. 

As  indicated  earlier,  the  algorithm  G  conducts  its 
search  for  a  match  by  attempting  at  each  point  to  ex¬ 
tend  and  refine  its  current  hypothesis  about  the  imaging 
situation.  This  hypothesis  has  two  parts:  the  match  M 
found  so  far,  and  the  locus  l,  of  possible  positions  of 
tlie  camera.  In  the  course  or  the  matching  process,  the 
consistency  of  L  with  M  is  maintained  (modulo  errors 
in  measurement). 

Now  we  can  be  more  explicit  about  how  the  basic 
predict-observe-back-projcct  loop  is  carried  out.  We 
may  restrict  ourselves  to  considering  the  case  where  at 
least  one  edge  has  already  been  matched,  since  predic¬ 
tion  and  back-projection  do  not  apply  to  the  matching 
of  the  first  edge;  all  acceptable  candidates  for  matches 
to  the  first  edge  must  be  considered,  wherever  they  ap¬ 
pear  in  the  image,  bet  M  be  the  current  match,  and  L 
the  current  locus.  G  selects  an  object  edge  e  which  has 
not  as  yet  been  matched.  For  the  sake  of  brevity,  when 
we  refer  to  the  “position”  of  an  edge,  we  will  hence¬ 
forth  mean  its  position  and  orientation.  Bounds  on  the 
position  of  the  image  (j  (e)  of  the  new  edge  e  can  be 
predicted  simply  by  selecting  an  already  matched  edge 
eo.  and  computing  the  bounds  on  the  possible  position 
D  {[p,  <?]>«)  relative  to  (jp,  17],  c0)  as  p  ranges  over  the 
current  locus  //(recall  that  the  value  of  q  docs  not  affect 
relative  measurements).  This  prediction,  together  with 
che  known  position  of  M(e 0),  give  predicted  bounds  on 
the  position  of  the  image  J|  (e)  of  e. 

A  similar  method  can  be  used  for  back  projection. 
Suppose  that  an  image  edge  M(e)  has  been  matched 
to  the  ohjcct  edge  edge  e.  Back-projection  consists  01’ 
restricting  the  current  locus  //  to  the  smaller  locus  IJ 
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which  is  consistent  with  the  measured  position  of  M(e) 
i.et  f0  be  some  already  matched  edge,  and  A/(e0)  its 
match  m  the  image.  //  is  then  just  the  set  of  camera 
positions  v  in  L  from  which  the  predicted  position  of 
U  (MJ.e)  relative  to  JJ.  ([/>,  <7],  e0)  is  the  same  as  the 
measured  position  of  A  1(c)  relative  lo  M{c0),  to  within 
measurement  error. 

This  scheme  preserves  consistency  of  matches  if 
measurement  errors  are  negligible. 

1  he  algorithm  G  as  described  so  far  docs  not  take 
into  account  the  fact  that  any  given  object  edge  may 
or  may  not  he  visible  depending  on  the  position  of  the 
camera.  The  following  extension  to  G  deals  with  this 
aspect  of  the  matching  problem. 

In  the  prediction  step,  rather  than  selecting  an 
arbitrary  unmatched  edge  as  was  done  before,  G  selects 
an  edge  e  whose  visibility  is  consistent  with  the  current 
hypothesis  (formally,  and  edge  whose  locus  of  visibility 
intersects  the  currently  hypothesized  locus).  Then,  a 
case  analysis  according  to  whether  the  edge  is  actually 
visible  is  performed. 

On  one  side  of  the  case  analysis,  G  assumes  that  e 
is  visible,  and  restricts  the  currently  hypothesized  locus 
accordingly:  the  new  restricted  locus  is  the  intersection 
of  the  current  locus  with  the  locus  of  visibility  of  the 
edge.  Then  G  proceeds  as  before:  it  predicts  the  posi¬ 
tion  of  e,  looks  for  it  in  the  predicted  position,  and 
back-projects  if  found. 

On  the  other  side  of  the  case  analysis,  G  assumes 
that  the  edge  is  invisible,  and  again  restricts  the  cur¬ 
rently  hypothesized  locus  accordingly:  this  time  the 
restricted  locus  is  the  intersection  of  the  current  locus 
with  the  complement  of  the  locus  of  visibility  of  c.  If  the 
restricted  locus  is  empty,  then  the  current  attempt  at 
a  match  has  failed,  and  G  backtracks  to  the  last  choice 
point.  If  the  restricted  locus  is  non-empty,  G  proceeds 
by  selecting  another  object  edge  to  match. 

This  case  analysis  step  constitutes  a  choice  point 
for  hack-tracking.  Ihns,  for  each  edge  selected  for 
matching,  G  assumes  first  that,  the  edge  is  visible  and 
looks  for  it.  If  this  course  of  action  leads  to  a  good 
match,  then  all  is  well.  Otherwise  G  backtracks  and 
looks  for  a  match  under  the  assumption  that  the  edge 
is  invisible. 

In  the  above,  an  edge  should  be  considered  visible 
only  if  -  in  addition  to  meeting  the  usual  criteria  -  its 
projection  is  long  enough  to  allow  detection  by  the  edge 
finding  program.  An  edge  which  presents  itself  end-on 
to  a  given  camera  position  should  not  be  considered 
visible  from  that  camera  position. 

Among  other  details  about  G  which  have  been  sup¬ 
pressed  so  far  is  the  method  by  which  loci  of  camera 
positions  arc  represented.  A  very  simple  representation 
is  adequate  for  our  purposes.  Suppose  that  we  have 
a  scheme  for  partitioning  the  unit  sphere  into  an  ar¬ 
bitrary  number  patches  such  that  the  diameters  of  the 


Patches  go  to  zero  as  their  number  increases.  Then  a 
locus  can  be  represented  to  any  desired  resolution  by  a 
set  of  patches  from  a  partition  of  adequate  size.  More 
precisely,  a  locus  I,  is  to  be  represented  by  the  set  of 
patches  from  the  partition  which  contain  some  point 
of  The  resolution  of  this  representation  is  bounded 
by  the  maximum  diameter  of  a  patch,  Thus  loci  are 
represented  by  subsets  of  a  finite  set.  These  in  turn 
may  be  represented  by  bit  maps:  one  bit  is  allocated  to 
each  patch  on  the  sphere.  Bit  maps  are  a  particularly 
good  representation  for  the  current  application,  since 
the  operation  most  frequently  performed  on  loci  is  in¬ 
tersection,  and  intersection  of  bit  maps  is  very  fast  on 
any  computer.  The  particular  scheme  which  we  have 
chosen  for  partitioning  the  sphere  is  not  the  best  but 
the  simplest.  The  partition  is  generated  by  first  impos¬ 
ing  a  regular  grid  on  the  faces  of  a  cube.  The  cube  is 
then  projected  radially  onto  the  sphere.  The  patches  on 
the  sphere  which  we  end  up  with  are  simply  the  projee- 
t'O'is  of  grid  elements  from  the  faces  of  the  cube.  In 
recent  experiments,  we  have  used  6  by  6  grids  on  the 
faces  of  the  cube,  yielding  a  total  or  218  patches.  This 
representation  is  depicted  in  figure  1.  One  3G  bit  PDP- 
10  machine  word  ia  allocated  to  each  face.  So,  a  locus 
is  represented  by  U  machine  words. 


figure  1 


VVe  are  still  not  done  with  the  development  of  G. 
A  major  shortcoming  of  G  as  described  so  far  is  that  it 
relics  on  perfect  performance  by  the  edge  finder  -  each 
edge  on  the  object  which  is  in  view  must  be  detected 
by  edge  finder  if  the  method  to  function  properly.  This 
kind  of  perfect  performance  is  not  obtained  by  existing 
edge  detection  programs,  nor  can  it  be  obtained  by 
any  edge  detector  which  relics  on  local  image  intensity 
discontinuities  to  detect  edges,  since  object  edges  do  not 
•always  give  rise  to  such  intensity  discontinuities. 
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Tim  matching  algorithm  may  Im  modified  in  order 
to  take  into  account  the  imperfections  of  the, edge  finder 
by  accepting  matches  in  wliicli  only  a  fraction  of  the 
expected  edges  are  present.  If  such  a  modification  is 
made,  the  criteria  of  match  success  and  match  failure 
become  more  complicated  It  is  necessary  to  deter 
mine  conditions  under  which  a  partial  match  should 
be  dropped  because  of  inadequate  success  in  finding 
predicted  edges,  and  also  conditions  under  which  a 
match  should  he  accepted  as  re'iable  evidence  that  the 
object  being  looked  for  has  actually  been  found. 

In  order  to  do  this,  we  would  like  to  he  able  to 
assess  the  probability  that  a  given  partial  match  arose 
from  the  object  in  the  manner  claimed.  If  this  probabil¬ 
ity  is  very  low,  then  the  match  should  be  dropped,  and 
if  very  high,  it  should  be  accepted.  For  intermediate 
values,  more  edges  should  he  matched,  if  possible,  in 
order  to  i.ccumulatc  further  evidence. 

Direct  estimates  of  this  probability  are  difficult  to 
make  in  the  usual  cases.  Nonetheless,  we  can  proceed 
by  estimating  conditional  probabilities,  and  use  qualita¬ 
tive  considerations  to  get  from  these  conditional  prob¬ 
abilities  to  the  needed  conclusions  concerning  a  given 
match 

On  the  one  hand,  it  is  possible  to  estimate  the 
probability  that  the  configuration  of  edge's  making  up 
a  match  arose  by  chance  given  any  particular  assumed 
“background”  distribution  of  edges  giving  rise  to  chance 
matches,  since  the  specificity  of  each  prediction  involved 
in  the  match  is  known,  as  is  the  number  of  such  predic¬ 
tions  which  have  been  met.  We  will  refer  to  the  inverse 
of  this  probability  as  the  “reliability”  of  the  match. 
The  background  distribution  of  edges  usually  cannot 
be  derived  from  first  principles,  but  is  best  determined 
by  gathering  statistics  on  sample  images  of  the  kind  on 
which  the  algorithm  is  to  be  used. 

On  the  other  hand,  suppose  that  we  have  a  partial 
match  in  which  some  fraction  of  the  predicted  edges  are 
missing.  Then  we  can  estimate  the  probability  that  the 
given  set  of  edges  would  be  missed  by  the  edge  detec¬ 
tor  under  the  assumption  that  the  partial  match  did 
arise  as  claimed.  We  will  refer  to  this  probability  as 
the  “plausibility”  of  the  match.  Kstimating  plausibility 
requires  information  about  the  performance  of  the  edge 
detector.  As  in  the  case  of  edge  distributions,  deriving 
this  kind  of  information  from  first  principles  is  difficult; 
again,  compiling  statistics  from  sample  images  is  a  bet¬ 
ter  idea.  Note  that  underestimating  the  performance 
of  the  edge  detector  will  lead  to  robust  performance  by 
the  matching  algorithm. 

So  the  reliability  of  a  match  measures  how  un¬ 
likely  it  is  to  have  arisen  assuming  it  is  in  fact  incor¬ 
rect,  and  its  plausibility  measures  how  likely  it  is  to 
have  arisen  assuming  it  is  in  fact  correct.  Assuming 
that  the  presence  of  the  object  in  the  field  of  view  is 
moderately  likely  and  there  are  unlikely  to  be  impostors 
of  the  object  in  view,  it  follows  from  Hayes’  Hide  that 
high  reliability  provides  strong  evidence  that  the  match 


is  correct,  while  very  low  plausibility  provides  strong 
evidence  that  the  match  is  incorrect.  (Notes'  (a)  bow 
reliability  does  not  provide  evidence  that  the  match  is 
incorrect,  nor  does  high  plausibility  provide  evidence 
that  the  match  is  correct  (b)  lly  an  “impostor”  in  the 
above,  we  mean  an  object  which  is  regarded  as  distinct 
from  the  target  object,  but  looks  nearly  the  same.) 

The  following  modifications  to  the  algorithm  G  are 
needl'd  to  deal  with  imperfect  edge  finding.  First  of  all, 
(!  must  maintain  estimates  of  the  reliability  It  and  the 
plausibility  /'  of  the  current  match  in  the  course  of  its 
search.  When  It  exceeds  a  predetermined  threshold, 
the  match  should  be  accepted,  and  when  /’  Tails  below 
another  predetermined  threshold,  backtracking  should 
occur.  Second,  (1  needs  to  perform  a  case  analysis 
according  to  whether  each  expected  edge  is  detected  by 
the  edge  finder,  in  addition  to  the  case  analysis  which 
it  already  performs  concerning  whether  the  edge  is  in 
view.  This  case  analysis  will  also  constitute  a  choice 
point  for  the  purpose  of  back  tracking.  Thus,  G  will 
proceed  as  follows.  For  each  new  edge  e  which  it  selects 
for  matching,  it  will  (I)  assume  that  c  is  in  view,  (2) 
assume  that  v  is  detected,  (II)  look  Tor  c,  (d)  continue 
the  match.  If  the  match  failed,  then  it  will  backtrack 
to  (2)  and  assume  that  c,  though  in  view,  was  not 
delected,  and  will  proceed  to  the  matching  of  other 
edges.  Finally,  if  this  last  match  fails,  it  will  backtrack 
to  (I)  in  the  manner  described  earlier. 

The  algorithm  as  it  now  stands  is  no  more  than 
an  elaboration  on  the  simplest  of  matching  algorithms: 
sequential  matching  with  backtracking.  Nonetheless, 
if  we  judge  the  clliciency  of  a  matching  method  by 
the  number  oT  matching  steps  which  it  goes  through 
in  searching  for  a  correct  match,  the  algorithm  does 
not  come  out  badly.  The  principal  reason  Tor  this  is 
that  only  a  few  edges  on  an  object  oT  known  shape 
need  to  be  identified  in  order  to  determine  the  position 
and  orientation  oT  the  object.  In  fact,  identification 
of  the  image  projections  of  three  non-colincar  points 
on  the  object  is  sufficient  to  narrow  the  set  of  possible 
positions  and  orientations  of  the  object  to  at  most  two 
distinct  possibilities.  (This  last  statement  holds  exactly 
for  orthographic  projection,  and  applies  to  perspective 
projection  as  well  unless  the  camera  is  very  close  to 
the  object,  or  the  precision  of  measurement  is  very 
high,  in  which  case  one  of  the  t  wo  possibilities  may 
be  eliminated).  A  fourth  uon-c.oplanar  point  suffices 
to  remove  the  remaining  ambiguity.  The  identification 
of  three  pairwise  non-parallel  lines  (without  specified 
end  points)  will  accomplish  the  same  task.  Thus,  a 
match  does  not  need  to  proceed  very  far  before  the 
locus  of  possible  positions  oT  the  camera  will  have  been 
narrowed  to  only  a  Tew  grid  (mints  by  back  projection. 
Thereafter,  the  matching  of  additional  edges  serves  to 
check  the  correctness  of  the  match,  but  not  to  further 
refine  the  estimate  of  camera  position.  The  positions  of 
these  additional  edges  will  be  predicted  accurately,  and 
as  a  consequence  the  probability  that  many  such  edges 
will  be  found  in  the  case  of  an  incorrect  match  will  be 
exceedingly  low.  Thus,  bad  matches  are  likely  to  fail 
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very  early  Conversely,  the  reliability  of  ii  good  match 
will  rise  quickly  as  more  expected  edges  are  found,  so 
that  the  cost  of  achieving  a  very  reliable  match  is  low. 
So,  the  cflcctivcness  of  the  current  algorithm  relies  on 
the  fact  that  it  require*  exact  quantitative  matching  of 
object  edges  to  image  edges  at  all  stages  during  a  match. 


4.  Specialization  and  instantiation  of  the  one 
stage  algorithm 


The  result  of  unwinding  the  main  loop  of  the 
.schematic  algorithm  (5  described  in  the  last  section  may 
be  diagramed  informally  in  the  following  way. 


Here,  “find(e)"  represents  the  prcdict-obscrvc- 
back-project  operation  by  which  new  edges  are 
matched,  “c  «=sclcrt-edge()”  represents  the  selection  of 
an  unmatched  edge  to  look  for,  "assuiiicvis(e)”  repre¬ 
sents  the  case  analysis  according  to  visibility  of  edges 
and  assiimcfnd(r)"  represents  the  case  analysis  ac¬ 
cording  to  whether  the  edge  has  been  detected.  Our 
job  now  is  to  fill  in  the  details  in  this  unwonrd 
schematic  algorithm,  making  use  as  appropriate  of 
the  fact  that  the  object  description  is  available  in  ad¬ 
vance.  hi  the  following  discussion,  wo  will  employ 
the  coin  pile- time/ run- lime”  terminology  familiar  from 
compiler  design.  Operations  which  arc  carried  out  in 
the  course  of  constructing  specialized  variants  of  (1 
will  be  refered  to  as  "compile-time”  operations,  while 
operations  carried  out  by  those  specialized  variants  in 
the  analysis  of  images  will  he  refered  to  an  “riiii-time" 
operations. 

The  conipilc-tiinc  process  by  which  the  above 
schematic,  search  tree  is  filled  in  may  he  thought  or  as 
moving  from  the  root  of  the  tree  down,  fully  instantiat¬ 
ing  nodes  as  it  goes.  Imagine  for  the  moment  that  the 
lirst  k  levels  of  the  tree  have  been  filled  in,  and  that  the 
task  at  hand  is  to  fill  in  a  particular  node  at  next  level. 
As  will  he  seen  in  a  moment,  selection  or  the  object 
edge  lo  be  matched  at  each  point  in  the  tree  is  done 
at  compile- time.  That  is,  the  select- cdgeQ  operation  is 
executed  at  r.ompile-timc,  so  that  each  node  in  the 
instantiated  searc  tree  will  refer  to  a  particular  object 


edge  to  he  matched.  The  tree  developed  to  level  k  will 
look  something  like  this: 


assninevisfci ) 
no  yes 

assume  vi.s(co)  assujnefnd(c,) 

/  ^  y/wo  ycs\^ 

»  assumevisfcj)  finci(ei) 

;  :  /  \  : 


7miicvis(ey) 
io  yes\^ 


assumefnd(cj) 
/ no 


yes\^ 

find(cy) 


figure  8 


The  box  indicates  nodes  to  he  filled  in  at  the  cur¬ 
rent  stage.  We  will  deal  with  all  the  nodes  involved  in 
matching  a  particular  object  edge  at  once,  rather  than 
following  a  strict  level  by  level  order.  In  what  follows 
we  will  specify  how  each  or  the  operations  in  the  boxed 
nodes  arc  instantiated. 

(1)  The  find  operation.  This  involves  prediction  of 
the  position  of  Cy,  a  check  to  see  if  any  image  edges  lie  in 
the  predicted  position,  and  hack  projection.  Recall  that 
prediction  is  carried  out  by  computing  hounds  on  the 
location  or  cy  relative  to  an  already  matched  edge  c0, 
and  then  using  the  known  position  of  e0  to  get  hounds 
on  the  position  of  cy  in  the  image.  The  position  or 
one  edge  cy  relative  to  another  <?„  may  he  specified  in 
various  ways.  The  only  requirement  here  is  that  the 
relative  position  he  given  in  a  way  that  is  invariant  un¬ 
der  translations  and  rotations.  In  any  case,  a  vector 
of  four  numbers  [u,,a2,nJ,o.l|  suffices.  For  example, 
ai  »a'-J  in i^fi L  give  the  coordinates  of  the  center  of  e- 
relative  to  the  image  coordinate  system  with  origin  at 
the  center  of  cq  and  x-axis  directed  along  e0,  cij  the 
length  or  cy,  and  a,,  the  orientation  or  cy  relative  to  e0. 
bet  relpoa(ej,c0,p)  denote  the  position  of  cy  relative  to 
c0  from  camera  position  p  in  whatever  representation 
is  chosen.  More  generally,  let  rclpo»(v}-,c0l  K)  denote 
hounds  on  the  components  or  relpon(e :y,  r0,  p)  as  p  ranges 
over  fonts  K .  bet  i  he  the  currently  hypothesized  locus 
at  the  time  that  the  find  operation  is  executed.  We 


99 


want  to  come  ite  rrlpon(c},  Co,  L).  The  following  very 
simple  scheme  is  adequate.  Namely,  at  compile  time, 
rrlpo»[cj,Cn,g)  is  computed  for  each  grid  element  g, 
and  stored  in  a  table  Then,  rclpox(cj,cu,  L)  is  com¬ 
puted  at  run  time  simply  by  taking  the  union  of  the 
bounds  rclpos(rj,  e{),  g)  for  tjil,.  These  unions  are  taken 
component  wise,  so  that  the  end  result  of  the  process 
is  a  set  of  numerical  upper  and  lower  bounds  on  each 
of  the  components  of  the  relative  position.  Note  that 
this  manner  of  computing  bounds  loses  some  informa¬ 
tion  since  constraints  relating  dilferent  components  of 
the  relative  position  are  not  expressed  by  simple  bounds 
on  the  components.  As  a  result,  edges  with  positions 
predicted  in  this  way  may  not  be  consistent  with  any 
camera  position  in  the  locus.  Hut  this  is  comparatively 
unlikely,  and,  in  any  case,  bad  edges  of  this  kind  will 
be  thrown  out  in  the  back  projection  stage. 

The  above  method  is  a  very  crude,  but  it  ran  be 
made!  quite  fast.  For  example,  less  than  50  machine  in¬ 
structions  per  grid  element  are  required  to  carry  out  the 
prediction  operation  in  an  aggressively  coded  implemen¬ 
tation  on  the  |*1)|’  10,  VAX,  or  Motorola  1)8000.  The 
number  of  grid  elements  which  need  to  be  considered  in 
a  given  prediction  step  of  course  depends  on  the  details 
of  the  match  in  progress.  In  most  matches,  the  size  of 
position  loci  to  be  considered  decreases  rapidly  as  the 
match  proceeds.  For  the  experiments  described  later, 
the  average  prediction  step  involved  less  than  10  grid 
elements. 

Klficicrt  implementation  of  the  observation  stage 
is  a  standard  exercise  in  computational  geometry.  The 
problem  is  to  design  a  data  structure  for  storing  image 
edges  such  that  the  set  of  edges  satisfying  a  given 
prediction  can  be  retrieved  rapidly.  Any  of  a  variety 
of  methods  involving  binary  search  on  the  parameters 
of  the  prediction  will  do. 

The  tables  of  relative  positions  constructed  at  com¬ 
pile  time  for  the  prediction  step  can  he  used  for  back- 
projection  as  well.  In  order  to  determine  which  grid 
elements  of  the  current  locus  arc  consistent  with  a  given 
set  of  measurements,  it  is  only  necessary  to  compare  the 
measured  values  to  the  bounds  rclpos(e}-,  eo,  (j)  which 
have  been  pre-computed  for  each  grid  element  g. 

(2)  Selection  of  the  next  edge  to  look  for. 

In  the  course  of  a  match,  the  currently 
hypothesized  locus  of  camera  positions  is  refined  in  two 
ways:  by  back-projection,  and  by  making  assumptions 
about  the  visibility  or  invisibility  of  particidar  edges. 
The  data  necessary  for  the  latter  kind  of  refinement  is 
available  at  compile  time.  As  a  result,  each  point  in  the 
instantiated  search  tree  has  an  associated  c.oinpilc-tirnc 
locus  of  possible  camera  positions  -  namely,  the  set  of 
camera  positions  which  arc  consistent  with  the  visibility 
assumptions  made  on  the  path  leading  from  the  root  to 
the  current  node. 

Our  task  is  to  select  at  compile- time  an  appropriate 
object  edge  to  look  ror  next  at  the  current  stage  of 
the  match.  There  a.e  three  considerations  which  arc 


relevant  to  this  selection  First,  the  like  li hood  that  t  hr 
selected  edge  is  visible  should  l><  as  high  its  possible  We 
don’t  wish  to  select,  an  edge  which  will  he  visible  from 
only  it  small  fraction  of  the  current  locus  or  which,  even 
if  visible,  the  edge  lindi  r  is  unlikely  to  detect,  since  the 
computation  time  spent,  looking  for  it  will  then  he  tin 
likely  to  pay  olf.  Second,  the  prediction  of  tin  position 
of  the  edge  should  he  as  specific  its  possible,  siuct  this 
will  load  to  ;t  lower  likelihood  of  false  matches  for  the 
edge.  Third,  it  is  desirable  that  measurements  on  the 
image  position  of  the  observed  edge  should  provide,  its 
nine  1 1  information  as  possible  about  the  camera  posi¬ 
tion.  bach  of  these  factors  ran  Is'  evaluated  in  a  quan¬ 
titative  manner  at  compile  time.  Assuming  a  uniform 
probability  distribution  on  the  position  of  the  camera 
(or  more  generally,  assuming  any  particular  prior  dis¬ 
tribution  on  camera  positions)  and  statistics  about  the 
performance  of  the  edge  detector,  the  probability  of  the 
visibility  of  any  edge  over  any  locus  can  be  computed 
Similarly,  the  specificity  of  a  prediction  is  naturally 
measured  by  the  inverse  of  the  probability  that  a  ran¬ 
domly  chosen  edge  will  meet  the  prediction  Phis  in 
turn  rail  he  computed  in  a  straight- forward  way  from 
the  bounds  involved  in  the  prediction.  The  numerical 
values  of  the  bounds  for  the  prediction  can  be  computed 
for  each  possible  camera  position  in  (lie  eoinpilc-time 
locus  at  compile  time.  By  averaging  these  bounds  (using 
tin'  weighting  of  the.  prior  distribution  on  camera  posi¬ 
tions),  an  expected  spccilicity  for  a  prediction  of  a  given 
edge  can  be  determined  at  compile  time.  Finally,  in  a 
similar  manner,  the  information  obtained  from  measur¬ 
ing  the  position  of  any  given  edge  can  be  evaluated  for 
each  camera  position  at  compile  time. 

By  this  method,  the  best  edge  to  match  next  can 
be  determined  at  compile  time,  assuming  that  a  way 
of  combining  the  factors  listed  above  has  been  chosen, 
’flie  question  of  exactly  what  weight  should  he  given  to 
each  factor  is  a  complex  one.  In  qualitative  terms,  the 
order  in  which  the  considerations  have  been  stated  here 
reflects  their  order  of  importance  for  the  efficiency  of 
the  matching  process.  Any  weighting  function  which 
respects  this  order  of  importance  is  likely  to  be  accept¬ 
able.  We  regard  the  detailed  analysis  of  this  question 
as  an  open  research  topic. 

Note  that  the  determination  of  tire  best  edge  to 
match  next  at  any  stage  is  computationally  expensive, 
since  it  involves  calculations  concerning  each  edge  at 
each  camera  position.  But,  as  remarked  above,  large 
amounts  of  computation  time  at  compile  time  arc  often 
justified  for  smaller  gains  at  runtime. 

(3)  The  visibility  case  analysis.  The  method  by 
which  visibility  case  analyses  are  performed  has  already 
been  fully  specified,  so  no  compile  time  instantiation 
of  the  node  is  needed.  However,  certain  visibility  case 
analyses  can  he  dispensed  with  entirely  based  on  infor¬ 
mation  available  at  compile  time.  It  will  often  happen 
that  the  particular  edge  cy  whose  visibility  is  in  question 
is  in  fact  visible  throughout  the  compile-lime  locus,  so 
that  no  case  analysis  according  to  its  visibility  need  he 
performed.  That  is,  it  will  often  happen  that  along  the 
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p.ilh  taken  from  (he  root  of  the  search  tree  to  the  cur¬ 
rent  node,  visibility  assumptions  have  been  made  which 
together  guarantee  the  visibility  of  the  current  edge,  In 
this  case,  the  case  analysis  node  is  simply  left  out. 

Note  that  this  optimization  is  a  very  important  one 
in  that  it  greatly  reduces  the  size  of  the  search  tree 
(though  usually  lias  only  a  minor  elfect  on  its  depth).  If 
n  edges  are  involved  in  a  match,  then  in  principle,  there 
are  2"  distinct  combinations  of  visibility  and  invisibility 
ror  the  edges  to  be  considered  Of  course,  only  a  small 
fraction  of  these  cases  can  actually  occur,  since  since 
the  visibility  of  edges  are  not  determined  independently. 
1'or  example,  there  are  26  distinct  visibility/invisibility 
combinations  for  the  edges  of  a  cube  (one  for  each  face, 
edge,  and  vertex  which  the  viewer  might  be  “facing”), 
rather  than  212  =  4096. 

(d)  The  detection  case  analysis.  We  treat  detec¬ 
tion  case  analyses  in  a  similar  manner  to  visibility  case 
analyses:  we  drop  a  detection  case  analysis  if  the  as¬ 
sumption  that  the  current  edge  is  not  detected  causes 
the  plausibility  of  the  current  match  to  drop  below 
threshold.  The  information  about  the  current  n  atch 
which  is  needed  to  compute  its  plausibility  -  namely,  the 
information  as  to  which  edges  have  been  detected  and 
matched  and  which  have  not  -  is  available  at  compile- 
time.  It  can  be  read  oil"  by  following  the  path  from  the 
root  of  the  search  tree  to  the  current  node. 

We  can  sum  up  the  speed  gains  achieved  by 
specialization  in  a  very  rough  way  by  noting  that  the 
time  required  for  an  average  matching  step  in  the 
specialized  program  will  be  on  the  order  of  a  few 
miliseconds  on  a  1  MU’  machine.  As  a  consequence, 
several  hundred  matching  steps  can  be  executed  per 
second  in  the  search  for  a  match.  This  speed  is  much 
greater  than  that  obtained  by  existing  methods,  and  is 
adequate  for  a  variety  of  practical  applications. 


5.  Experiments 

The  scheme  for  generating  special  purpose  vi¬ 
sion  programs  described  above  has  been  implemented 
in  MacLisp  running  on  the  l’DIMO  at  the  Stanford 
Artificial  Intelligence  Laboratory.  A  few  refinements 
not  described  earlier  arc  included  in  the  implementa¬ 
tion.  I'or  example,  in  this  account,  we  have  not  con¬ 
sidered  the  cfTcct  of  measurement  error,  nor  have  we 
described  any  method  for  matching  partially  visible  (or 
partially  detected)  segments.  The  implementation  in¬ 
cludes  machinery  for  dealing  with  both  of  these  matters. 
Also,  until  now  we  have  required  that  the  distance  to 
the  object  be  known  exactly  in  advance.  This  require¬ 
ment  is  weakened  in  the  implementation;  it  is  generally 
sufficient  if  the  distance  is  known  to  within  a  factor  of 
2. 

On  the  other  hand,  the  implementation  docs  not 
yet  lully  automate  the  selection  of  the  order  in  which 


edges  are  treated;  in  the  experiments  we  cliosr  the  or¬ 
der  by  hand.  Nor  does  it  come  close  to  realizing  the 
potential  for  speed  of  the  underlying  algorithm.  For  ex 
ample,  efficient  data  structures  and  accessing  methods 
for  the  set  of  image  edges  have  not  been  implemented. 
The  speed  figures  given  earlier  arc  estimates  of  what 
could  be  obtained  in  an  aggressive  implementation,  not 
measurements  of  current  performance. 

So  far,  tests  involving  three  different  objects  have 
been  run.  The  objects  treated  were  a  connecting  rod 
casting,  a  universal  joint  casting,  and  a  key-cap  (key¬ 
caps  are  the  plastic  keys  which  make  up  typewriter  and 
terminal  keyboards).  In  each  lest,  a  special  purpose  pro¬ 
gram  was  generated  automatically  from  a  description  or 
the  object;  this  program  was  then  applied  to  images  of 
the  object  digitized  from  a  television  camera  In  the 
case  of  the  connecting  rod  and  universal  joint,  the  pic¬ 
tures  contained  only  one  instance  of  the  object  against  a 
relatively  uncluttered  background  These  images  were 
successfully  analyzed  with  relatively  little  effort  by  the 
vision  programs;  correct  matches  were  obtained  in  each 
case  after  fewer  than  50  matching  steps. 

The  special  purpose  program  for  recognizing  key¬ 
caps  was  subjected  to  a  more  arduous  test.  We  digitized 
an  image  of  a  jumbled  pile  of  key-caps  (see  figure  4). 
The  edges  found  in  this  image  by  David  Marimont's 
edge  finder  [Marimont  1982]  are  displayed  in  figure  5. 

T  he  task  of  the  key-cap  recognition  program  was  to 
find  instances  of  key-caps  which  were  -  roughly  speak¬ 
ing  -  within  45  degrees  of  right-side-up.  More  precisely, 
the  locus  of  allowable  orientations  of  the  camera  rela¬ 
tive  to  the  key-cap  was  Lhc  locus  making  up  the  Lop 
face  of  the  cube  in  the  scheme  for  representing  loci 
described  earlier.  Key  caps  of  a  variety  of  shapes  ap¬ 
pear  in  the  image;  only  key-caps  with  square  upper  faces 
were  sought  by  the  program.  This  is  a  severe  test  for 
the  matching  method  for  several  reasons:  (1)  Objects  of 
the  desired  kind  must  be  recognized  in  a  complex  back¬ 
ground  -  a  background  in  which  many  objects  similar 
to  the  target  object  appear.  (2)  The  target  object  has 
only  a  limited  number  of  features  on  which  the  match 
can  be  based.  (3)  Resolution  is  quite  low.  Although  the 
entire  image  has  a  resolution  of  240  by  240  pixels,  each 
object  to  be  recognized  occupies  only  a  40  by  40  region. 
Also,  fighting  and  contrast  were  not  particularly  good. 

The  program  was  run  in  a  mode  in  which  not  just 
one,  but  every  match  meeting  the  reliability  criteria  was 
returned.  Further,  the  reliability  threshold  was  set  at  a 
very  low  level,  so  that  every  plausible  match  was  found. 
The  matches  found  were  then  ranked  by  reliability.  The 
top  three  matches  in  this  ranking  were  in  fact  correct. 
They  arc  displayed  in  figure  6.  Most  of  the  remaining 
matches  -  which  had  been  assigned  lower  reliability  - 
were  incorrect.  The  total  number  of  matching  steps  re¬ 
quired  in  this  experiment  was  960;  so,  in  the  hypotheti- 
cal  “aggressive  implementation”  mentioned  earlier,  the 
whole  process  would  take  a  couple  of  seconds. 

This  experiment  indicates  that  the  matching  algo¬ 
rithm  can  find  matches  under  difficult  circumstances. 
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In  this  case,  due  to  the  am  all  number  of  matching  fea¬ 
tures,  it  is  not  possible  to  achieve  very  high  reliability 
of  matches.  Typical  industrial  objects,  such  as  the 
castings  mentioned  above,  have  many  more  features  on 
which  a  match  can  be  based,  and  hence  allow  very  good 
reliability  of  delected  matches. 


figure  5 


figure  6 


6.  Extensions 

We  have  described  in  some  detail  the  construc¬ 
tion  of  specialized  variants  of  a  comparatively  simple 
matching  algorithm.  It  should  be  evident  that  the  same 
general  scheme  can  be  applied  to  more  complex  match¬ 
ing  algorithms  which  handle  a  wider  class  of  problems, 
or  which  exploit  additional  structure  in  the  matching 
situation  in  order  to  enhance  performance. 

An  elaboration  of  the  current  algorithm  which 
is  particularly  useful  and  to  which  our  scheme  for 
specialization  extends  easily  is  as  follows.  If  the  tar¬ 
get  object  has  symmetries,  or  if  more  generally  there 
are  recurring  patterns  of  edges  on  the  object,  then  the 
matching  process  may  proceed  by  first  seeking  an  in¬ 
stance  of  the  recurring  pattern,  and  then  performing  a 
case  analysis  according  to  which  of  several  instances  of 
the  pattern  has  been  encountered.  This  modification 
will  speed  up  the  matching  process  in  the  cases  to 
which  it  is  relevant,  since  a  good  match  is  likely  to 
be  found  sooner,  and  since  extensive  bad  matches  to 
the  “wrong"  instance  of  the  pattern  will  be  avoided. 
The  same  technique  can  be  used  for  matching  of  mul¬ 
tiple  objects  which  share  common  patterns  of  features. 
Again,  the  common  pattern  is  matched  first,  and  then 
a  case  analysis  as  to  which  object  Lbe  pattern  arose 
from  is  performed.  The  technique  may  be  applied  recur¬ 
sively  to  very  large  sets  of  target  objects  which  have 
been  classified  in  a  hiearchical  manner  according  to  a 
taxonomy  of  common  features.  The  taxonomy  is  ex¬ 
ploited  by  a  matching  method  which  perforins  a  kind 
of  binary  search  down  the  hierarchy  until  a  complete 
match  is  found. 


More  generally,  the  matching  algorithm  which  we 
have  considered  is  an  instance  of  an  extremely  common 
kind  perception  algorithm.  Such  algorithms  are  built 
up  irom  interleaved  observation  steps,  in  which  some 
detectable  ((iiality  of  the  world  is  predicted,  observed, 
and  used  to  reline  tire  current  world  model,  and  case 
analysis  steps,  in  which  assumptions  are  made  about 
the  world  assumptions  which  are  not  justified  by  any 
data  or  argument,  hut  which  are  necessary  to  decide 
on  what  observation  to  make  next,  and  which  can  be 
withdrawn  Inter  rf  necessary.  This  kind  of  algorithm 
appears  usually  m  elaborated  form  -  in  many  areas 
of  computing.  Any  such  algorithm  can  be  specialized 
according  to  the  general  plan  which  we  used  here.  To 
perform  the  specialization,  we  proceed  by  lirst  unwind¬ 
ing  the  case  analysis  steps  into  a  full  tree  of  possibilities. 
'I  hen,  we  use  the  context  of  assumptions  available  at 
any  point  in  this  tree  to  optimize  the  work  performed 
at  that  point. 


7.  Related  Work 

The  particular  vision  problem  which  we  have 
chosen  to  attack  is  31)  model-based  vision  in  its  strict 
form:  the  exact  shape  of  the  object  to  he  recognized  is 
assumed  to  be  known  in  advance,  and  no  restriction  is 
placed  on  the  3D  orientation  of  the  object  relative  to  the 
camera.  Comparatively  little  work  has  been  devoted 
to  this  form  of  the  vision  problem.  There  appear  to 
be  two  traditions  of  work  in  model-based  vision,  one 
of  which  might  he  referred  to  as  21)  vision  from  exact 
models,  and  the  other  as  31)  vision  from  inexact  models. 
I  he  First  tradition  includes  work  focused  directly  on  in¬ 
dustrial  problems  such  as  that  of  Perkins  [Perkins  11)82], 
where  the  exact  shape  of  the  object  is  specified  in 
advance,  and  where  the  orientation  of  the  object  is 
restricted  in  such  a  way  as  to  reduce  the  problem  to  a 
“nearly"  21)  form.  The  second  tradition  treats  problems 
in  which  orientation  is  (comparatively)  unrestricted, 
hut  where  the  previously  available  information  about 
the  model  is  less  complete.  Kxamptes  of  this  kind 
of  work  include  [Garvey  1976]  and  [Shirai  1978];  here 
the  matching  processes  used  tend  to  employ  qualita¬ 
tive  rather  than  quantitative  restrictions  on  matches. 
Acronyin[Hrooks  1981 1  is  an  exception  to  the  above, 
in  that  it  does  exploit  quantitative  restrictions  on  the 
parameters  involved  in  a  match.  Acronym  uses  a  con¬ 
siderably  more  ornate  matching  scheme  than  ours.  Also 
it  uses  a  method  for  generating  numerical  constraints 
which  is  much  more  general  and  consequently  much 
slower  than  ours  -  by  a  factor  or  at  ieast  100.  (Acronym 
iloes  not  “compile”  the  object  model  into  a  fast  program 
as  we  do)  Still,  there  are  strong  similarities  in  approach 
between  our  work  and  the  work  on  Acronym. 

Thus  our  work  is  less  ambitious  than  some  pre¬ 
vious  work  m  31)  model  based  vision,  in  that  we 
restrict  ourselves  to  exactly  specified  models,  and  in 
that  we  arc  investigating  comparatively  simple  algo¬ 
rithms.  Nonetheless,  it  seems  to  ns  that  the  problems 


and  processes  involved  in  simple  sequential  matching  of 
ex  ictly  speeilied  models  are  riot  yet  well  understood, 
and  Ural,  as  a  research  strategy,  it  makes  some  sense 
to  concentrate  on  this  limited  domain  before  attacking 
matching  problems  of  a  more  general  kind. 

The  general  strategy  by  which  we  have  obtained  an 
ellieicnt  implementation  of  matching  namely  special 
purpose  automatic  programming  has  been  followed 
in  technically  dilferent  form  by  Holies  1992]  Holies 
lias  attaeked  the  problems  of  matching  21)  models  to 
images,  and  more  recently,  31)  models  to  range  images, 
by  what  he  calls  the  local  feature  focus  method,  'flic 
method  involves  selecting  a  class  of  “focus”  features 
of  similar  shape  on  the  object  from  which  the  match 
is  to  begin.  Then  maximal  sets  of  mutually  consis¬ 
tent  interpretations  for  features  near  a  given  candidate 
match  to  a  focus  feature  are  sought.  Such  a  “maximal 
clique”  of  consistent  interpretations  forms  the  seed  for  a 
more  complete  match,  which  is  done  sequentially.  (Sec 
[Holies  1982]  for  a  description  of  the  method)  This 
method  is  compiled  into  a  very  fast  matching  program 
by  the  same  kinds  of  methods  we  have  used,  bocal  fea¬ 
tures  to  he  matched  are  selected  in  advance,  ami  tables 
of  relative  positions  are  compiled.  Kxperiments  have 
shown  that  the  maximal  clique  method  is  robust  and 
fast  for  21)  matching. 

The  maximal  clique  method  does  not  extend  easily 
to  the  problem  of  31)  matching  from  intensity  (rather 
than  range)  images.  There  are  two  reasons  for  this, 
first,  the  maximal  clique  method  depends  on  the  tran¬ 
sitivity  of  the  consistent  interpretation  relation:  if  in¬ 
terpretation  A  for  point  a  is  consistent  with  interpreta¬ 
tion  H  for  point  b,  and  iT  interpretation  If  Tor  point  b 
is  consistent  with  interpretation  C  for  point  c,  then  in¬ 
terpretation  A  for  point  a  is  consistent  with  interpreta¬ 
tion  C  for  point  c.  This  transitivity  holds  for  21),  and 
for  31)  points  from  range  data,  but  not  Tor  21)  projec¬ 
tions  from  a  31)  model  with  arbitrary  orientation.  (Still, 
the  clique  method  might  be  used  for  more  complicated 
structures  Tor  which  this  transitivity  docs  hold).  The 
second  and  more  decisive  reason  for  the  difficulty  of  ex¬ 
tending  the  maximal  clique  method  to  31)  is  this.  The 
method  depends  on  the  possibility  of  selecting  a  reason¬ 
ably  small  set  of  image  features  which  are  candidates  for 
matching  features  near  the  already  matched  focus  fea¬ 
ture,  and  which  together  uniquely  identify  the  match. 
In  21)  or  31)  from  range  data,  the  identification  of  such 
a  small  set  of  features  is  aided  by  the  fact  that  the 
positions  and  orientations  of  the  local  features  relative 
to  the  focus  feature  are  known  in  advance.  In  31)  in¬ 
tensity  images  this  kind  of  advance  information  is  not 
available,  or  is  much  weaker,  so  that  the  set  of  nearby 
features  in  the  image  which  require  consideration  may 
be  quite  large.  Kqually  importantly,  the  number  of  pos 
siblc  interpretations  for  each  such  feature  will  be  large 
as  well  (the  size  of  the  graph  of  possible  interpretations 
in  which  cliques  must  he  found  is  of  order  h  X  n,  where 
k  is  the  number  of  local  features  considered,  and  n  is 
the  average  number  of  interpretations  of  each  feature). 

In  the  keycap  example,  the  interpretation  graph  would 
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I>c  so  large  tlmt  r.Ht|tic  finding  would  be  impractical. 
Sequential  matching  >«  less  vulnerable  to  this  kbit!  of 
problem  because  at  each  stage  in  the  match  all  of  the 
information  derived  from  the  match  s.<  far  is  used  to 
restrict  the  number  of  candidates  for  match  at  the  next 
stage. 

Holies’  work  is  very  closely  related  to  ours  in 
general  aim  and  stategy;  his  results  support  the  idea 
that  the  feature  matching  approach  to  vision  is  feasible 
and  robust. 
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fcstract 


I  his  paper  presents  the  system  description  and  organization  of  maps. 
e  Map  Assisted  Photo  interpretation  System,  maps  is  a  large 
integrated  database  system  containing  high  resolution  aerial 
photographs,  digitized  maps  and  other  cartographic  products, 
combined  with  detailed  II)  descriptions  of  manmade  and  natural 
features  in  the  Washington  D.C.  area.  A  classification  of  image 
database  systems  into  three  models  is  also  presented.  These  models  are 
the  Image  Database  (ID)  model,  the  Map  Picture  Database  (\ipi>) 
model  and  die  Image/Map  Database (imd)  model.* 


1 .  Introduction 

This  paper  presents  the  system  description  and  organization  of  MAPS, 
the  Map  Assisted  Photo  interpretation  System.  MAI’S  is  a  large 
integrated  database  system  containing  high  resolution  aerial 
photographs,  digitized  maps  and  other  cartographic  products, 
combined  with  detailed  3D  descriptions  of  man-made  and  natural 
features  in  the  Washington  I).  C.  area. 


This  paper  discusses  three  major  topics.  Hirst,  a  classiilc.it ion  of 
different  models  of  database  systems  for  cartographic  applications  is 
picsemed  together  with  a  discussion  ol  their  inherent  strengths  and 
limitations.  These  models  arc  the  Image  Database  (id)  model,  the  Map 
IVlnrc  Database  (Mrir)  Inodel  and  the  Image/M.ip  Database  (IMD) 
model.  Second. -wc  argue  for  liic  utility  of  the  Image/M, ip  Database 
model,  discuss^tasks  and  presenter  general  description  of  the  model. 
I  his  model  describes  components,  facilities  and  techniques  that  ,uld 


be  present  in  snJi  a  system,  and  a  range  of  tasks  that  can  he  supported 
by  the  model  finally,  we  describerthe  maps  system  in  terms  of  our 
(imd)  model,  and  discuss  three  applications  which  utilize  and  integrate 
image,  terrain,  and  map  data  in  a  powerful  manner.  Wc  also  discuss 
what  wc  have  learned  during  the  implementation  of  the  MAI’S  system, 
some  ideas  on  die  puiper  interlaces  between  components,  where 


modularity  should  be  tic  hies  cd.  and  point  to  future  work. 
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2. Background 

Our  early  motivation  for  investigating  image  databases  was  as  a 
component  of  a  complete  image  understanding  system.  AVc  had  only  a 
vague  idea  of  what  capabilities  it  should  have,  but  .vc  thought  that  it 
should  represent  "idealized  segmentations"  of  an  image,  where  the 
labeling  of  the  segments  was  in  fact  the  "scene  interpretation".  It 
should  relate,  or  compare  machine  generated  segmentations  to  this 
model,  and  provide  the  user  with  a  qualitative  and  quantitative 
performance  measure  of  the  machine  segmentation.  We  attempted  this 
with  the  At! das  system1  2  using  die  segmentation  results  for  a  set  of 
Pittsburgh  city  scenes  generated  by  the  ahgos1'4  system.  Hie  results  of 
the  performance  analysis  of  the  scene  segmentation  were  less  than 
encouraging.  While  wc  could  give  quantitative  analysis  of  the 
segmentation  and  labeling  by  the  akgos  system,  ihe  qualitative  results 
were  couched  in  the  original  (subjective)  hand  segmentations.  It  was 
difficult  to  qualitatively  distinguish  between  alternative  machine 
segmentations,  since  the  relative  importance  (or  cost  function)  of 
missing  or  mislabeled  regions  or  broken  boundaries  Tor  different 
regions  was  not  represented  in  the  segmentation.  I  low  to  perform  such 
an  evaluation  is  still  an  open  research  problem.  Also,  although  wc  had 
a  database  of  18  high  resolution  color  images  of  Pittsburgh,  wc  had  no 
general  mechanism  to  relate  one  to  another,  except  through  analysis  of 
the  hand  segmentations  and  the  names  given  to  buildings,  roads,  rivers, 
and  other  features  in  the  scene.  However,  in  die  process  of 
implementing  and  using  MIDAS  we  did  learn  a  great  deal  about  image 
database  organization  and  symbolic  representation  of  scene 
descriptions. 

Wc  decided  to  look  at  inap-guidctl  image  interpretation  and  began  to 
assemble  an  aerial  photograph  database  of  the  Washington.  1)  C.  area. 

I  sing  this  imagery,  we  felt,  we  could  quickly  generate  a  map  database 
that  would  allow  us  to  explore  image  analysis  of  complex  aerial 
photographs  using  a  simple  map  database  that  constrained  where  to 
look,  and  what  to  look  for  This  idea  of  map-guided  segmentation  was 
not  new.  The  iiaavki-vp  system5  and  succeeding  “mad  expert"6  7  were 
based  on  similar  ideas,  and  use  of  world  knowledge  had  been  a  well 
accepted  paradigm  in  image  interpretation.  However,  we  wanted  to 
focus  on  more  general  capabilities,  in  represent  large  scale  spatial 
organizations  normally  encountered  in  complex  urban  scenes.  Hie 
generation  of  the  map  database  turned  out  to  be  a  much  harder 
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problem  tli.in  \h  initially  estimated,  and  it  quickly  became  the  focus  of 
on r  research.  In  retrospect.  I  believe.  it  was  exactly  the  right  problem  to 
work  oil  and  although  llierc  is  still  much  to  do  in  the  area  of 
image/map  databases,  we  now  base  the  right  tools  and  understanding 
to  begin  to  tackle  the  original  problem.  This  work  has  direct 
application  in  Uiree  areas: 

•  photo-interpretation  representation  of  world  knowledge 
for  image  understanding. 

«  situation  assessment  a  spatial  expert  for  decision  support 
systems. 

•  cartography,  toward  digital  map  generation  and  use. 

3.  Classification  of  Databases 

There  has  been,  user  the  last  ten  years,  a  perceived  need  for 
organizing  and  structuring  image  and  map  data  for  cartographic 
applications.  It  has  been  difficult  to  compare  various  capabilities  and 
limitations  of  systems  because  there  were  few  common  denominators 
by  which  systems  could  be  compared.  Systems  reported  in  die 
literature  could  loosely  be  categorized  cither  as  research  vehicles,  or 
production-oriented  systems  for  particular  well  dclincd  subtas'ks  of  the 
general  cartographic  problem8  9  ln  Research  vehicles  generally  had  a 
high  degree  of  organizational  complexity  tested  on  very  small  scale 
databases.  Systems  used  in  production  environments  tended  toward 
simple  models  running  very  large  scale  databases.  Further,  while  the 
tasks  being  performed  involved  the  analysis  of  aerial  or  satellite  data,  it 
is  often  unclear  whether  the  image  data  was  an  integral  part  of  the 
resulting  database,  or  simply  used  fur  data  acquisition.  One  example  is 
the  development  of  digital  filing  systems  that  store  facts  about  a  large 
number  of  images  wi'liout  storing  the  actual  image  data.  The  best 
example  of  such  a  system  is  the  FROS  Data  Center  database 
maintained  by  the  U  S.  Dept,  of  the  Interior.  This  database  has 
approximately  2xlOh  frames  of  I  nndsai  imagery  and  5x I Oh  frames  of 
aircraft  (aerial  mapping)  photography.  Users  may  specify  an  area  of 
interest  by  geodetic  point  or  rectangular  area  and  sub-select  diusc 
frames  based  on  time  of  year,  cloud  cover,  type  of  sensor**  and  a  a 
scene  quality  rating.  However,  the  actual  frames  of  data  are  stored  on 
high  density  magnetic  tape.  Similar  situations  exist  in  map  producing 
organizations  such  as  the  United  States  Geological  Survey  (uses)  and 
die  Defense  Mapping  Accncy  DMA. 

One  notable  exception  is  described  in  Kondo  ct.nl. 1 1  where  an  image 
database  using  l.nndsat  imagery  was  integrated  with  map  descriptions 
for  geographic,  natural,  and  cultural  features.  Features  can  he 
displayed  superimposed  on  the  image  data,  and  imagery  could  be 
indexed  by  geodetic  location  or  by  feature  name.  There  arc  limitations 
such  ns:  the  image-to-map  correspondence  was  based  on  a  fixed 
decomposition  of  landsat  data  into  a  latitudc/longiiudc  grid  at  a  map 
scale  of  I  50000  the  spatial  relationships  between  features  were  entered 
manually;  and  the  overall  complexity  of  tire  image  and  map  database 
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was  small.  Nevertheless.  Olis  represents  an  ambitious  new  direction  for 
the  development  of  land-use  systems  using  I  .indsat  imagery. 

In  this  discussion  of  database  systems  for  cartographic  and  situation 
assessment  applications,  we  are  assuming  that  the  following  minimal 
capabilities  hold:  (1)  on-line  display  of  digital  imagery  and  map  data 
and  (2)  ability  to  query  interactively  about  attributes  of  the  imagery  and 
map.  The  following  is  our  classification  of  the  capabilities  of  three 
models  which  we  can  use  to  compare  various  existing  systems  or 
approaches.  Iltese  models  arc  the  Image  Database  (ID)  model,  die  Map 
Picture  Database  (Ml*0)  model  and  the  Image/Map  Database  (tMD) 
model. 

3. 1 .  Imago  Databases 

The  Image  Database  model  (ID)  is  the  simphesl  and  must  common 
database  model.  It  is  organized  to  relate  attributes  about  the  sensed 
image  such  as  sensor-type,  acquisition,  cloud  cover,  or  geodetic 
coverage*".  These  databases  generally  do  not  represent  the  content  of 
the  scene,  hut  rather  attributes  of  the  scene.  When  the  semantics  of  the 
scene  arc  present,  the  location  of  cartographic  features  are  represented 
in  the  image  (pixel)  coordinate  system.  'Ibis  poses  obvious  limitations 
to  the  application  of  relevant  knowledge  from  other  images  or  from 
external  sources,  since  there  is  no  general  mechanism  to  relate  map 
feature  position  between  images  that  overlap  in  coverage  or  to  an 
external  map.  Although  the  features  represented  may  appear  to  be 
map-oriented,  is  is  difficult  to  compute  general  geometric  properties 
using  the  image  raster  as  the  coordinate  system. 

Although  relational  database  techniques  have  been  applied  to  the  10 
model,  we  feel  these  techniques  arc  not  appropriate  to  spatial  database 
organizations  for  several  reasons.  First,  using  the  basic  <attributc. 
nib  '>  tuple  to  represent  vector  lists  of  map  coordinate  data  requires 
that  all  of  the  primary  key  attributes  be  duplicated  in  each  relation, 
since  there  is  no  mechanism  for  allowing  multiple  valued  (sets,  lists, 
ordc  pairs)  as  a  primitive  attribute  in  a  relation.  Further,  the  relational 
database  operations  such  as  union,  intersection,  join,  project,  arc  not 
good  primitives  for  implementation  of  inherently  geometric  operations 
such  as  containment,  adjacency,  intersection  and  closest  point. 
Operations  such  as  feature  intersection  arc  reduced  to  searching  for  line 
segments  which  share  the  same  pixel  position.  Finally,  in  any  large 
system,  a  logical  partitioning  of  the  database  must  be  performed  in 
order  to  avoid  extensive  and  often  unnecessary  search  when  performing 
spatial  operations.  Partitioning  is  difficult  to  achieve  in  relational 
systems  since  die  relational  model  restricts  itself  to  homogeneous  (only 
one  record  type)  sequential  sets.  Previous  work  advocating  such 
organizations  did  not  address  the  issues  of  system  scale,  and  focused 
more  on  issues  of  query  languages  using  relational  models  for 
geographic  databases  than  the  actual  construction  of  complex 
systems12, 1 8  ,4.  When  measured  by  die  number  of  images,  image 
based  features,  and  by  the  complexity  of  the  relationships  represented, 
these  sv stems  were  quite  simplistic 
•  •• 

using  H  annotation  such  as  the  ccnlcr  point  and  corner  points  not  using  general 
imagc-io-maj  respondent 
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3  2  Map  Pictu re  Databases 

Ilk'  M.ip  Picture  I  Xitubasc  model  (MI'li)  describes  databases  ih.ii  .ire 
geik  nod  In  digitizing  cartographic  products,  such  ;is  pre-existing  maps 
.md  -Jkirts.  I  hose  databases  .ire  .itti  .ictisc  in  cmironineiits  where  paper 
m.ips  h  isc  pi. is cd  .1  l.n  gc  role  in  pl.mning  nnd  analysis.  Iherc  are. 
however  some  major  limitations  to  spatial  systems  based  on  digitized 

ntogi.iphic  products,  l-irst.  in  the  original  map  production,  spatial 
ambiguity  has  been  reel  died  hr  the  eartogiapher  in  .1  manner  that  is  not 
often  reversible.  I  he  cartographic  process  involves  simplication 
(generalization).  classification  (abstraction),  and  symbolization  of  real- 
world  ambiguity  Constraints  imposed  lay  the  scale  of  the  map  often 
determine  which  woild  features  can  he  depicted  despite  the  desirability 
of  portraying  a  complete  spatial  representation.  I  licrcforc.  map  icon 
and  symbology  placement  may  not  be  as  accurate  as  the  original  source 
material.  Since  the  deduction  of  the  actual  spatial  arrangement  of 
objects  from  an  iconic  representation  is  an  open  problem,  \ti*n's 
represent  chaos  masquerading  as  rationalized  order.  Hie  key  issue  is 
tb  it  mod's  are  pictures  of  a  map  (however  detailed)  rather  than  the 
undct  lying  map  structure  and  spatial  oiganiz.uion.  Although  the 
graphics  display  of  MI’D  appears  to  convey  a  great  deal  of  semantic 
information,  that  impression  is  .1  result  of  die  human  observer,  not  a 
rclleetion  of  an  underlying  map  representation. 

When  a  map  is  digitized  into  a  map  picture,  another  subtle 
simplilic.ition  occurs.  I  lie  digitization  process  results  in  a  map  image 
on  a  rectangular  gi  id  whose  size  is  generally  limited  either  by  custom  or 
as  .111  artifact  of  the  digitization  process.  Common  limitations  arc 
scanner  resolution,  maximum  si/e  of  image  raster,  and  the  physical  size 
of  source  map.  One  popular  representation  is  to  subdivide  regions  of 
ihc  map  picture  into  .1  regular  decomposition  such  as  quad  tree15  l6.  or 
k-d  tree1.  I  he  implementation  of  this  representation  is  greatly 
simplified  in  MI’D  models  since  one  no  longer  has  to  contend  with 
positional  ambiguity  of  map  features  because  of  the  cartographic 
process  outlined  above,  and  die  discrete  nature  of  the  digitization 
process. 

One  common  use  for  die  Ml-D  model  is  in  geographic  information 
systems  for  land  use  and  urban  planning.  I11  these  systems,  aggregate 
values  such  as  population  of  an  aica  and  crop  yield  of  an  area  arc 
computed.  I  lie  scale  of  the  original  map  becomes  the  limiting  factor  for 
accuracy  in  information  computation.  However,  the  grain  of 
computation  is  usually  large  enough  that  these  inaccuracies  arc  not  ,1 
practical  problem.  I11crc111cm.il  update  of  the  database  due  to  new 
residential  and  industrial  areas, md  the  concomitant  l«x«  of  rural  areas  is 
.1  difficult  problem  since  database  update  requires  careful  map  editing 
tools  not  usually  associated  with  these  MI’D  systems. 

A  recent  trend  has  been  to  take  existing  MI’D  databases  and  add  a 
map  feature  database  component,  usually  relational  to  describe 
attributes  of  various  features.  Wc  believe  that  augmenting  traditional 
MI’D  databases  with  semantic  information  lias  merit  in  those 
enviroments  where  analysis  is  being  performed  by  bum, ins,  since 


information  synthesis  is  not  .1  requirement  of  the  database  system 
However,  once  such  .1  system  is  in  place,  there  is  a  tendency  to  attempt 
to  automate  analysis  functions  requiring  spatial  interpretation,  -md  the 
generation  method  of  the  MI’h  model  has  several  drawbacks  for  use  in 
photo-interpretation,  situation  assessment,  and  cartography.  The  chief 
problems  are  the  method  of  generation  as  outlined  above,  the  lack  of 
semantic  information  about  map  features,  and  the  requirement  that  a 
map  exist  at  the  appropriate  level  of  detail  for  the  area  under 
consideration.  I  lie  IMD  model  discussed  in  die  following  section 
addresses  these  issues. 

3.3.  Image/Map  Databases 

The  Image/Map  Database  model  (IMD)  relates  map  fcntuics  to 
image  database  dirougli  camera  models.  It  therefore  has  the  capability 
to  describe  relationships  between  features  acquired  from  different 
images  through  the  map  database.  This  capability  is  in  contrast  to  the 
image  database  model  where  the  feature  descriptions  can  only  be 
related  if  die  descriptions  come  from  the  same  image. 

Since  the  map  database  is  built  directly  from  aerial  imagery  in  the 
IMD  model,  the  resolution  /  accuracy  issue  is  a  function  of  the  ground 
resolution  of  the  imagery,  the  intrinsic  position  measurement  error  due 
to  camera  model,  ground  control,  etc.  rather  than  an  artifact  of  the  map 
depiction  scale  as  in  the  MI’D  model.  A  greater  variety  of  feature 
descriptions  is  possible  since  they  are  not  restricted  to  those  that  can  be 
portrayed  in  a  cartographic  product,  l-'tirtliei,  die  complexity  of  a 
particular  feature  description  is  independent  of  any  particular  task 
rcquii cincnt  and  can  represent  a  rich  set  of  attributes,  semantic 
interpretations,  and  knowledge  from  diverse  sources.  This  flexibility  is 
a  key  clement  for  map  data  representation  as  we  look  toward  spatial 
database  systems  with  applications  in  cartographic  production,  expert 
photo-interpretation,  and  situation  assessment. 

However,  just  as  the  cartographer  must  resolve  ambiguity,  so  die 
spatial  database  must  be  able  to  represent  inconsistency  in  a  consistent 
manner,  l-or  example,  errors  in  correspondence  between  images  and 
the  geodetic  model  cause  the  same  point  on  die  earth  to  be  given  a 
different  geodetic  position,  ic  when  viewed  from  different  images  the 
same  geodetic  point  produces  a  different  world  position.  If  this  point  is 
on  a  common  boundary  between  two  features,  say  a  political  boundary, 
there  should  be  ambiguity  as  to  which  region  the  point  is  in.  Ily  the 
same  token,  if  two  large  rcsidcntal  areas  arc  found  to  intersect  hccausc 
of  positional  uncertainty,  and  the  result  of  the  intersection  is  several 
small  polygonal  areas,  the  IMD  model  should  be  able  10  rectify  Ulis 
amhiguity.  This  rectification  might  take  the  form  of  a  symbolic 
relationship  that  indicates  di.it  the  residential  area  share  a  common 
boundary,  while  maintaining  the  ability  to  represent  die  original 
errorful  signal  data.  Since  the  original  data  is  maintained  in  the 
database,  the  symholic  relationships  do  not  have  to  be  static.  For 
example,  these  relationships  c.111  he  dependant  on  attributes  similiar  to 
(hose  used  by  cartographers  when  they  perforin  simplification  and 
generalization.  Ihc  link  from  the  symbolic  interpretation  back  to  the 
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iii  igin.il  source  data  is  not  possible  in  Mi'll  systems 
3  3  1.  Spjtial  Knowlodgo 

I  lie  imp  moilcl  pnes  ns  the  tools  to  construct  onr  map  cl.il.ih.ise  from 
"lirst  principles''  ami  lie  together  p.ntial  spatial  knowledge  at  different 
levels  or detail.  I  bis  is  possible  because  individual  map  features  may  be 
specified  directly  from  source  imagery.  I  Ins  capability  is  precluded  by 
the  derivative  nature  of  die  Ml'l)  model.  I  hot  is.  it  is  difficult  to 
assimilate  new  and  possibly  errorful  knowledge  because  of  the 
mismatch  between  the  new  errorful  data  and  the  cartographic 
rectification  of  ambigtons  data. 

I  be  representation  of  a  multiple  levels  ol  detail  paiadigm  is  often 
invoked  as  a  part  of  a  coarse  line  or  hier.ircbic.il  matching  strategy  in 
image  pmcessing  and  interpretation.  Given  the  scale  and  digtti/cd 
ground  resolution  of  an  image,  the  IMP  model  can  generate  a  map 
description  that  will  suppress  any  features  that  would  be  too  small  to  be 
rceogni/ed.  with  remaining  descriptions  at  the  appropriate  level  of 
detail.  I  Ins  technique  is  more  than  camcia  scaling  and  transformation, 
since  the  c  riterion  for  “too  small"  can  be  an  attribute  of  the  map  feature 
itself.  Consider  the  map  feature  description  of  a  university  campus.  At 
some  level  of  detail  corresponding  to  pixel  ground  resolution  distance 
(UltlD,  features  such  as  playing  fields,  dornutoi ies,  insti notional 
buildings  mid  offices,  access  roads,  and  campus  greenery  are  all 
individually  distinguished.  Using  spectral  properties  of  the  features**** 
and  spatial  relationships  between  these  features,  we  can  determine 
those  feature  boundaries  that  arc  likely  to  be  muddled,  and  those  with 
sufficient  detail  to  be  rccogni/cd. 

Hie  multiple  level  of  detail  paradigm  need  not  be  applied  in  a 
homogeneous  manner,  bar  example,  tasks  such  as  decision  aids  for 
photo- intelligence  may  require  high  resolution  detail  to  support 
analysis.  Inn  low  resolution  detail  to  establish  overall  context.  A  large 
scale  spatial  organization  containing  urban,  residential,  and  rural  areas 
will  require  llcxibility  to  represent  the  high  feature  density  and 
complexity  in  the  urban  area  as  well  as  significantly  lower  density  in 
rural  areas. 

flexible  knowledge  acquisition  is  necessary  because  in  photo- 
interpretation.  situation  assessment,  and  cartography,  world  knowledge 
is  inherently  fragmented.  Knowledge  fragmentation  in  these  domains 
arises  from 

•  methods  of  knowledge  acquisition 

There  arc  diverse  sources  of  knowlcgc  that  are  used  lu 
acquire  map  feature  information.  Sonic  of  the  most 
common  tire  direct  measurement  from  imagery,  old  maps 
and  charts,  sketches,  and  collateral  data. 

•  task  requirements 

If  the  task  requirement  is  to  support  radar  scene  simulation. 


•  ••• 

for  example:  roads  preserve  linear  properties  until  die  cro  approximately  equals 
the  width  of  the  road 


then  elevated  toads  ate  significant,  and  road  networks  m 
geitvr.il  are  not  significant.  If  the  task  is  to  support  map 
genet. ition  at  a  p.uticulai  scale  (say  1:5110(10).  the  feature 
size  density  may  determine  whether  it  is  directly  portrayed, 
generalized,  or  omitted  entirely.  I  here  are.  of  comsc.  well 
defined  i tiles  that  govern  these  decisions,  hut  they  are 
generally  not  consistent  across  a  wide  range  of  map  settles, 

•  specialization  in  feature  extraction 
I  here  is  a  cciiatn  amount  of  specialization  in  caitographic 
and  situation  assessment  aetiviti-’s.  \nalysts  may  specialize 
in  a  particular  area  of  the  world,  he  knowlcdgahlc  in 
Indiology,  geology,  local  construction  customs,  or  political 
matters.  In  the  production  of  large  scale  maps  it  is  rare  to 
find  map  generalists,  although  this  may  not  he  true  for  low 
level  feature  extraction  activities.  I  his  specialization  tends 
to  fiugment  knowledge,  and  is  often  given  as  a  justification 
fur  building  database  systems  that  provide  access  to  a  wide 
range  of  map  knowledge  and  may  have  general  capabilities 
lor  knowledge  synthesis. 

I  he  IMP  model  methodology  provides  a  mechanism  for  feature 
unification  in  a  cohesive  framework.  It  provides  a  framework  to  relate 
symbolic  descriptions  to  their  original  data  sources.  It  is  not  tied  to  a 
particular  caitographic  representation  nor  to  limitations  of  cartographic 
production. 

4.  The  Database  Problem  in  Image  Interpretation 

I  he  database  problem  has  been  addressed  in  a  variety  of  ways  in 
systems  that  perform  image  analysts  and  interpretation.  However,  it 
has  rarely  been  pursued  ns  a  separate  research  problem.  One 
explanation  for  this  is  that  portions  of  general  database  represention  are 
often  embedded  in  die  experimental  image  processing  systems  and 
become  highly  limed  to  the  application.  Ilns  is  sometimes  a  result  of 
system  performance  issues,  or  ease  of  task-specific  implementations, 
boi  often  it  is  a  result  of  not  recognizing  die  database'  problem  ns  a 
separate  issue. 

It  is  difficult  to  give  a  precise  analysis  of  the  use  of  map  databases  in 
image  interpretation,  since  die  detailed  organizations  of  experimental 
systems  arc  rarely  available.  However,  there  are  several  recent 
examples.  Work  at  SKI  used  a  map  datahasc  of  road  intersections  to 

construct  a  camera  u  tdcl  in  the  IIAWMVI  and  subsequent  “road 
expert"  systems' 7 

Hie  Atoms'  ■'  system  used  a  digili/cd  city  plan  map  and  elevations 
for  buildings  to  build  a  11)  graphics  model  of  downtown  I’ittsburgh. 
Ibis  model  was  directly  compiled  into  a  knowledge  network 
representation  which  described  si/c,  shape  and  relative  positions  of 
buildings,  roads,  rivers,  and  bridges  for  an  arbitrary  view  point. 
Although  it  was  not  tied  to  a  geodetic  grid,  it  was  a  general  map  model. 
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Kcveiii  work  .it  Hughes18  hased  on  the  \nu)\>M  system  developed 
by  Tilooks  und  Hinl'ord10  uses  im.ige  registration  to  .1  geogroplv.  mod  'I. 
I  he  system  uses  prc-sclectcd  regions  of  interest  ,nnl  .ntempts  to  locutc 
find  ide mi I's  predefined  objeet  inst.mees  within  these  areas. 

ACRONYM  is  currently  the  host  ex.ttnple  ol'.i  model  based  system 
dun  incorporates  viewpoint  insensitive  nieclnmisins  in  terms  of  its 
model  description.  Its  recognition  process  is  to  nup  edge-based  image 
propel  ties  to  instances  of  object  models  In  the  domain  of  aerial  photo 
interpretation,  results  have  been  icported  for  the  recognition  of  a  small 
number  of  models  (.1)  for  wide  bodied  jets  in  aerial  photographs.  It  is 
not  clear  bow  map  knowledge  would  he  directly  integrated  into  die 
\(  Kosvvt  framework  hut  one  could  speculate  that  it  could  he  added 
hv  a  method  similar  to  the  work  at  Hughes  described  above 

Matsuyama‘S  1]  has  demonstrated  a  system  for  segmentation  and 
interpretation  of  color- infrared  aerial  photographs  containing  roads, 
rivers,  forests,  and  residential  and  agriculuir.il  areas.  It  uses  rules  to 
make  assignments  based  on  region  adjacency  and  multi-spectral 
properties.  I’hcsc  rules  make  use  of  informal  map  knowledge  but  do 
not  directly  use  a  particular  map  to  guide  interpretation.  It  generates 
good  descriptions  of  a  variety  of  fairly  complex  aerial  scenes  getting  a 
great  deal  of  constraint  from  the  multi-spectral  data. 

In  Ills  recent  thesis.  Selft  idgc'’2  proposed  using  adaptive  threshold 
selection  for  region  extraction  by  histogramming  and  region  growing 
using  an  image-based  “appearance  model".  Although  the  work 
desetibes  feature  positions  and  shapes  in  terms  of  pixel  descriptions.  It 
is  not  difficult  to  imagine  a  more  general  map-lutscd  approach  that 
would  result  in  the  automatic  generation  of  constants  to  his  adaptive 
operators. 

At  CM  l .  Herman'  has  demons! rated  the  feasibility  of  incremental 
acquisition  of  .11)  scene  descriptions  from  stereo-pair  aerial 
photographs  in  the  maps  database  in  die  .11)  Mosiac  project.  I  his 
system  requires  a  known  stereo  camera  model  hut  uses  no  a-priori 
knowledge  about  the  scene  other  than  weak  geometric  assumptions 
about  urban  environments. 

5.  The  Image/Map  Database  Model 

In  diis  section  wc  discuss  four  classes  of  tasks  that  arc  common  to 
photo  interpretation,  situation  assessment,  and  cartography.  Wc  dten 
list  some  criteria  by  which  one  can  evaluate  the  strengths  and 
limitations  of  database  systems.  These  criteria  arc  not  exhaustive, 
rather  dicy  point  to  four  areas  dial  should  he  present  in  IMO 
implementations  and  system  capabilities  in  each  of  the  areas. 

5  1  Tasks  for  Image/Map  Da’abose 

In  this  section  wc  give  a  classification  of  tasks  that  arc  common  to 
applications  in  photo-interpretation,  situation  assessment,  and  digital 
cartography  systems.  I  he  four  tasks  are  selection  of  image,  terrain,  or 


map  data  based  on  attributes  of  die  tint#,  spatial  computation  of  map 
femme  relationships,  semantic  compulation  of  map  features,  and 
yi  nthcsis  of  imagery,  terrain  and  map  data, 

I.  Selection 

Hie  selection  task  requires  th.it  the  imp  system  be  able  to 
select  from  a  potentially  large  set  of  database  entities  based 
on  attributes  of  image,  terrain,  and  map  database  features. 

I  lie  selection  task  docs  not  require  image-to-map 
eon  espondeme.  and  is  the  task  normally  performed  by  id 
model  systems.  For  example: 

•  select  imagery  witli  particular  intrinsic  characteristics: 
sensor,  scale,  date,  cloud  cover,  processing  history 

•  select  map  features  based  on  symbolic  description, 
partially  specified  description,  similarities  in  image 
acquisition 

1  Spatial  Computation 

Spatial  computation  is  ubiquitous  in  cartographic,  situation 
assessment  and  photo-interpretation  tasks.  An  i\n>  system 
must  provide  tools  to  compute  common  spatial 
relationships  such  as  containment,  closest  point,  adjacency, 
and  intersection.  One  issue  is  how  to  structure  the 
environment  in  order  to  constrain  search  and  thereby  avoid 
unnecessary  computation.  Consider  four  views  of  the  same 
problem: 

•  given  a  geodetic  area,  which  images  cover,  or  partially 
cover  this  area 

•  w  It  icli  roads  can  he  found  within  the  image 

•  which  images  contain  this  building 

•  given  an  image,  find  all  images  which  overlap  it 

i  Semantic  Computation 

I  here  arc  a  number  of  tasks  that  require  more  than  basic 
spatial  computation,  or  where  the  appropriate  spatial 
operation  depends  on  die  meaning  of  the  map  objects.  Arc 
there  intrinsic  high-level  properties  of  map  feature  that  wc 
can  extract  from  basic  spatial  geometry  that  give  a  meaning 
to  the  feature?  Semantic  computation  needs  to  be 
investigated  as  we  develop  more  complex  spatial  databases. 

For  example,  what  is  die  semantics  of  'intersection1  for  the 
following  pairs  of  map  objects? 

•  intersection  of  two  roads 

•  intersection  of  bridge  and  river  description 

•  intersection  of  a  building  and  a  road 

4.  Synthesis 

One  goal  of  any  database  system  should  be  to  bring 
together  diverse  sources  of  knowledge  into  a  common 
framework  Synthesis  is  the  generation  of  new  information 
using  a  new  method  of  presentation,  computation,  or 
analysis.  For  example: 
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•  cartographic  superposition  of  m.ip  data  on  newly 
acquired  image 

•  31)  display  of  terrain  and  cultural  features  from  map 
database  including  man-made  structures,  political 
boundaries,  neighborhoods,  arbitrary  collections  of 
physically  realized  features 

•  to  predict  spatjal  (location)  and  structural 
(appearance)  constraints;  where  to  look  and  what  to 
look  for  based  of  task  knowledge,  pres  ions 
experience,  or  expectations 

•  a  spatial  fr.imewoik  within  which  to  embedd  task- 
specific  knowledge 

5.2  Criteria  for  Image/Map  Database 

In  this  section  we  list  some  criteria  that  can  be  uses  to  esaluate 
database  systems  in  four  general  areas.  These  areas  are  image-to-map 
correspondence,  map  feature  representation  spatial  computation,  and 
database  synthesis. 

I  Image-to-Map  Correspondence 

•  can  the  it  relate  image-based  features  to  a  map 
coordinate  system 

•  can  these  features  he  projected  onto  new  imagery 
using  the  correspondence  mechanism 

•  chill  capabilities  exist  for  incrementally  updating 
feature  descriptions  based  on  updates  to  the  camera 
model,  or  to  intrinsic  changes  to  the  feature  itself 

2.  Representation 

•  whiit  are  the  capabilities  for  feature  representation; 
what  complex  spatial  relationships  can  represent;  how 
is  inconsistency  recognized  and  handled 

•  can  the  user  describe  features  and  associated 
attributes  in  a  flexible  manner;  what  is  die  variety  of 
attributes. 

•  can  the  representation  accommodate  map-based 
information  coming  from  a  variety  non-imagery 
sou  rccs 

•  what  is  the  relationship  between  the  representation  of 
signal  and  symbolic  data 

•  what  synthesis  tasks  docs  the  representation  support 
V  Spatial  Computation 

•  does  the  system  support  dynamic  spatial  queries 

•  what  spatial  relationships  docs  die  system  compute 
directly  from  the  underlying  data,  which  relationships 
arc  specified  by  the  user,  how  do  diey  interact,  how 
docs  one  maintain  consistency 

•  what  mechanisms  arc  available  to  partition  the  search 
space  when  computing  spatial  relationships 


■I.  Database  Synthesis 

•  imagery,  tcn.iin  and  map  data  are  components,  each 
with  an  appropriate  representation,  operation 
semantics,  and  utility:  in  wli.it  ways  does  the  database 
support  synthesis  of  tlicsee  components 

•  what  concrete  tasks  requiring  synthesis  are  performed 

6.  MAPS  Overview 

In  the  previous  sections  we  have  attempted  to  raise  issues  of 
I  mage /Map  Database  organization,  tasks  and  capabilities.  In  diis 
section  we  will  discuss  the  MAI’S  system  components  capabilities.  We 
will  only  briefly  describe  those  aspects  that  lime  been  reported  on  in 
other  papers.  Our  latest  work  in  the  area  of  hierarchical  organization, 
decomposition,  and  se.uch  is  reported  beginning  in  Section  6.6.  New 
work  in  map  feature  semantics  is  discussed  in  Section  6.7.  l-or  a  more 
detailed  description  of  the  image  segmentation  program  (Section  6.1  2) 
and  the  im.ige-to-niap  correspondence  program  (Section  6.3)  see 
McKeowir4.  Tor  a  detailed  description  of  the  conci  pimaI’  database 
see  MtKcown  \  Appendix  I  contains  a  nearly  complete  list  of  die 
programs  associated  with  each  system  component. 

6.1 .  BROWSE:  Interactive  Image/Map  Display 

UltOvvsi  2<‘  is  an  interactive  w  indow -based  image  display  system.  !| 
provides  a  common  interface  to  all  of  the  maps  system  components  to 
display  results  of  queries,  graphical  prompts  for  interactive  image-to- 
map  correspondence,  superimpostion  of  map  data  on  imagery,  and 
other  similar  functions.  While  often  viewed  as  an  application  issue,  a 
flexible,  functional  user  interlace  is  critical  for  building  more  complex 
tools,  iikowsi  provides  the  user  with  a  window-oriented  interface, 
which  greatly  increases  the  effective  spatial  resolution  of  the  frame¬ 
buffer,  and  provides  multiple  processing  contexts  which  allow  users  to 
manipulate  dynamically  the  size,  level  of  detail,  and  visibility  of 
imagery. 

6.1.1.  Window- based  Display 

We  have  applied  and  extended  die  hit-map  window’  paradigm  to 
handle  high  resolution,  multi-hit  per  pixel  digitized  images.  However, 
due  to  nearly  an  order  of  magnitude  difference  in  the  amount  of  data 
needed  to  perform  screen  updates  and  due  to  processing  limitations 
found  in  most  frame  buffer  architectures,  many  of  the  solutions  used 
for  single  bit  per  pixel  displays21*  are  not  suitable  for  direct 
implementation.  A  detailed  discussion  of  the  design  and  organization 
of  the  window  manager  appea’-s  in  McKcown  &  Denlinger26. 

Resides  the  display  of  imagery,  we  have  found  the  window 
representation  to  be  useful  as  a  communication  mechanism  between 
mai’S  components,  to  invoke  image  processing  programs,  and  to 
retrieve  and  display  the  results  of  such  processing.  All  MAI’S 
components  (sec  Appendix  I)  that  display  imagery,  map  data  or 
graphics  use  the  irnowst  window  mechanism  for  display  and 
communication.  Tor  example,  the  interactive  image  correspondence 


110 


program,  iokki  s.  uses  the  window  mechanism  to  aiilomniicnlly  display 
laadnmrk  image  fragments  and  to  create  a  high  resolution  window 
containing  the  approximate  position  ol'  the  landmark  ground  control 
point  to  cue  the  user.  l*|t  I’AC  contains  a  collection  ol'  image  processing 
routines  that  can  he  invoked  on  imowst.  windows  simply  hy  specifying 
the  window  name  imowst  routines  use  the  window  mnne  to 
determine  tile  image  name,  resolution,  and  rectangular  image  bounds. 
I  his  inlormation.  along  with  parameters  specific  to  the  particular 
processing  operation,  are  pa  s-’d  to  the  image  processing  routine.  The 
results  of  the  operation  can  lie  displayed  in  a  nexv  window. 

6.1.2.  Interactive  Imago  Segmentation 

shah  si  is  an  interactive  image  segmentation  program  which  uses 
theliiunvsi  window  facility  topmvidcan  interface  to  out  frame  buffer. 
Users  can  extract  image-based  descriptions  of  map  features,  edit 
existing  features,  and  assign  symbolic  names  to  the  features,  stosti  vr 
produces  a  standard  format  [SI 0)  file  that  is  used  throughout  the  maps 
database  to  represent  image-based  descriptions  of  point,  line,  and 
polygon  geometric  data.  Database  routines  discussed  in  Section  6.5  are 
available  to  convert  the  (si  G)  description  to  a  map-based  description 
[Ml. 

6.2.  image  Database 

I  he  MAI'S  system  currently  contains  approximately  100  digitized 
images,  most  of  which  aie  loxx  altitude  aerial  mapping  photographs, 
lypical  ground  resolution  distances  (oitn)  are  I ?0cni2.  360cm2,  and 
600cm"  per  pixel.  I  lie  imagery  is  mainly  comprised  of  three  data  sets 
taken  in  1974,  1976  and  I9S2.  In  addition  to  aerial  mapping 
photographs,  we  have  several  digitized  maps  including  a  USCS 
topographic  map.  and  tour  guide  maps,  figure  I  gives  the  current 
status  of  the  M  \I’S  Washington  l).C.  image  database  Although  we 
have  seveial  I  andsat.  Sky  lab  and  high  altitude  aerial  photographs  taken 
over  the  Washington  D.C.  area,  we  have  focused  our  work  on  dtosc 
images  that  pmv  idc  the  greatest  ground  detail. 


CLASS 

NUMBER 

IMAGE 

SCALE 

DATABASf 

RASTER 

COMMENTS 

ASC • 74 

25 

1  36000 

2O40a2O4B*8 

Aerial  napping 

BW 

WGL 1 70 

37 

t : 12000 

2200x2200*0 

Aerial  napping 

BW 

AER ' 79 

2  1 

. 124000 

22 88x  22BB  » B 

Color  infrared 

ASC 'BP 

29 

1  60000 

2300x2300x8 

Aerial  napping 

BW 

MAP  •  71 

1 

1  24000 

4090x4096x8 

USGS  topo  nap 

MAP  74 

1  1 

. 1C  000* 

4096x3080x8 

0  .C  region  nap 

MAP  79 

* 

1 :  10000* 

4096x409Gx8 

Tourist  guide  nap 

•  not  i 

car tograph ical ly  accurate. 

figure  I:  MAI'S:  Image  Database  Component 


6  2  1  Generic  Image  to  File  Mapping 

I  ic  maps  system  uses  a  generic  naming  convention  to  refer  to 
images  m  the  database.  The  generic  name  is  a  unique  identifier 
a1  signed  to  the  image  when  it  is  integrated  into  the  database,  for 
example,  IX38M7,  DCI4’0  arc  representative  generic  names  that 


correspond  to  Might  line  annotation  on  the  photographic  film.  Ml  ty  pes 
of  image  access  that  require  the  filesystem  name  of  the  image,  or 
require  associated  image  database  files,  use  the  generic  name 
mechanism  in  construct  the  appropriate  physical  file  name.  It  is 
possible  to  change  the  logical  and/  r  phy  ical  location  of  imagery  by 
updating  the  generic  name  file  or  to  add  another  image  to  the  database. 
As  wo  move  to  larger  image/map  systems  this  naming  isolation  allows 
us  to  construct  a  database  that  can  be  distributed  over  multiple 

I  he  decoupling  of  name  with  physical  or  logical  location  fits  well  with 
name  server  otg, mirations  usually  employed  with  such  i  unbilled 
systems. 

I  he  following  table  lists  the  database  files  associated  will  i 

image  in  the  MAI’S  database  l  aclt  is  accessible  using  the  g’ 
name. 

•  |til  si  lot  |  iniagi-to  file  system  mapping 

-  contains  the  I  tic  system  location  of  the  database  image 

-  identifies  which  reduced  resolution  images  arc  computed 
and  available  for  hierarchical  display 

•  [sin  |  scene  description  file 

-  contains  image  specific  information:  source,  date,  time  of 
day.  raster  size,  digitization,  image  scale,  geodetic  corner 
points,  camera  information 

•  (C  ot  |  iin.ige-to-niap  coefficients  file 

-  contains  camera  model  cue  Ticients.  error  model, 
polynomial  orders  solved,  best  c.  rrcspnndcncc  (default 
polynomial  order) 

-  independent  coefficients  for  <latinidc>.  <longiiudc>, 

<image  rou>  <imagecolumn> 

•  [<  0«|  correspondence  pairs  file 

-  mapping  of  ground  control  points  to  image  point 
specification 

-  lists  of  landmark  names  and  their  geodet'c  position 
combined  with  image  pixel  position  of  landmark  specified 
by  user 

•  [ID  I’j  hypothesized  landmark  file 

-  lists  of  landmark  names  which  arc  within  the  image 
geodetic  coverage,  but  were  not  used  to  perform  image-map 
correspondence 

6  2  2.  Imago-Based  Segmentations 
maps  maintains  several  types  of  image  segmentations  and  map 
overlay  descriptions  associated  with  each  image  in  the  database.  IT  esc 
segnienlaions  cither  are  feature  descriptions  generated  using  the  image 
as  the  base  coordinate  system,  or  the  projection  of  map  features  onto 
the  image  using  niap-to-iinage  correspondence,  or  segmentations  from 
other  images  registered  to  the  image.  In  the  latter  case,  image  to-map 
correspondence  is  used  to  register  the  two  images.  Users  can  point  to 
segmentation  overlay  features  using  the  display  interface  in  liROWSF 

and  t  OM  l  l*l  M  ,1’,  identify  the  segmentation  feature . tic  and  retrieve 

its  image  and  geodetic  eooidiuatcs.  lor  the  [iiimnsk;|  and 
[( list  l  I’Iski]  segmentation  descriptions,  the  name  of  the  segmentation 
feature  is  used  to  retrieve  the  associated  111  All  (sec  Section  6  4)  or 
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COM  l  Ml  Mai*  description  lire  following  table  is  ;i  list  of  image 
segmentations  associated  with  c.ich  image  in  the  database. 
Segmentations  that  require  map  correspondence  for  their  generation 
can  he  autonulie  ill)  recreated  when  image  camera  model  is  update'1 

•  III  vvisi  G)  hand  (lininan)  segmentation 

•  collection  of  all  hand  segmentations  performed  on  litis 
image 

•  [IICOMI'SI  Ci|  composite  hand  segmentation 

•  collation  of  all  features  in  the  (iiamistg)  database  that 
arc  spatialh  contained  in  this  image 

•  |M.\CI  ISIG'l  maelnne  segmentation 

-  collection  of  all  machine  segmentations  performed  using 
tlie  image 

•  [MCOMI'SI  G|  composite  machine  segmentation 

•  collection  of  all  features  in  the  JM  aciisig)  database  that 
are  spatially  contained  in  the  image 

•  |t)l  MSSt  G|  t)t  MS  map  overlay 

•  all  features  from  the  Dt  MS  digital  feature  analysis  database 
that  arc  spatially  contained  in  the  image 

•  ICOSCI  I’ISIG)  CONCI  t'lMAP  map  oterlay 

•  all  features  from  the  concitimap  database  that  are 
spatially  contained  in  the  image 

•  K'ovt  ksi-'GJ  image  coverage  oterlay 

•  all  images  whose  area  of  coverage  is  overlapped  or  wholly 
contained  within  the  image 

6.3.  tmage-to-Map  Correspondence 

The  MAI'S  system  uses  an  interactive  imagc-lo-map  correspondence 
procedure  to  place  new  imagery  into  coi rcspomlcncc  with  the  map 
database.  It  has  three  major  components:  a  landmark  database,  a 
landmark  creation  and  editing  program,  and  an  interactive 
correspondence  program.  Tlie  process  of  landmark  selection, 
description,  and  interactive  correspondence  has  been  described  in  detail 
in  MeKcown24. 

6.3.1.  Landmark  Database 

MM’S  maintains  a  database  of  aporoximately  H00  geodelie  ground 
control  points  in  the  Washington  l).C  area.  I  andinarks  ire  acquired 
using  USGS  topographic  maps,  but  in  principle  can  he  integrated  from 
any  source  that  provides  accurate  geodetic  position 
Cliililiiilr/liiiigihilc/rlrrtilionX  Users  can  query  the  database  to  find 
landmarks  by  name  w  itbin  a  geodelic  area,  or  lire  closcsi  landmark  to  a 
geodetic  point.  I  andinark  features  arc  also  integrated  into  the 
COM  l  l’IMAl'  database  and  can  be  found  using  the  <mle-Jrriniiioii> 
attribute  (see  Section  6.5.2)  of  a  concept  role  schema. 

6  3.2.  LANDMARK 

IAMIMaKK  is  an  interactive  tool  used  to  generate  new  landmarks, 
their  text  descriptions,  and  associated  image  fragments.  Hie  following 
information  is  maintained  by  I  AMiMARi:  io  support  lamlmaik  database 
access. 


•  |l  iiM]  lamlmaik  name  director) 

•  associates  the  list  id'  l.uulmaik  names  with  their  geodelic 
position 

•  sorted  for  spatial  proximity 

-  partial  name  matching  also  provided 

•  |l  h)  lamlmark  text  description 

•  contains  a  detailed  texi  description  of  the  location  of  the 
landmark  and  general  factual  properties  of  the  landmark 

•  si  ores  die  location  and  name  of  the  associated  image 
fiagnicm  file  |i  imgj.  ami  replicates  the  geodetic  position 
from  him  file 

•  |t  IMG)  landmark  image  fragment 

•  contains  a  high-rcsolulion  image  fiagmcnt  which  clearly 
shows  die  ground  control  point  and  scene  context  around 
die  point 

6.3.3.  CORRES 

coniti  s  is  an  interactive  imagc-lo-map  correspondence  program.  It 
uses  ihc  ititowst  window  interface,  the  Landmark  database,  and 
image  database  routines  to  interactively  build  an  imagc-lo-map 
correspondence.  Once  an  initial  guess  of  the  corner  points  is  performed 
and  die  |rOR)  and  (COij  files  have  been  created  in  the  image  database, 
coma  s  .mnim.iiiailly  suggests  new  possible  landmark  points  using  the 
image  database  |livp)  files.  Ihc  UXD.makk  database  |ttMG]  files  arc 
used  to  display  the  ground  control  point  when  the  user  selects  it  from 
the  list  of  hypothesized  points. 

6.4.  dlms:  An  External  Database 

I  he  ability  to  rendezvous  willi  externally  generated  map  databases  is 
a  key  capability  in  order  to  integrate  information  from  a  variety  or 
sources.  One  example  or  the  llcxiliility  or  tlie  MAI’S  database  is 
illustrated  by  our  experiences  with  the  Defense  Mapping  Agency's 
(.im  \)  Digital  l-mdmass  Simulation  System  (in  MS)J9. 

dims  is  composed  of  a  digital  feature  analysis  database  (DI-AD) 
which  describes  man-made  cultural  features  and  a  digital  terrain 
elev  at  ion  database  (Dll  n)  which  is  oigani/cd  as  a  raster  elevation  grid. 

■  he  specified  resolution  or  the  lit  All  data  is  comparable  to  map  scales 
of  1:250.000  to  1:100.000.  Ihc  specified  resolution  of  ijii;d  data  is 
within  a  meter  vertical  resolution  over  a  I00’  meter  (3  arc  sec)  grid. 

6.4.1.  dfad:  Digital  Feature  Analysis  Database 

In  order  to  integrate  the  tit  At)  database  into  MAI'S,  we  reorganized 
the  internal  ill  ad  data  structures  to  allow  for  random  access  using  a 
feature  header  list.  We  convened  die  representation  of  geodetic 
coordinates  from  an  ol'lsct  foimai  that  was  relative  to  an  intern  il  base 
coordinate  to  an  absolute  coordinate  sysiem.  Our  in  ad  database 
covers  a  two  degree  square  area,  from  latitude  N  38°  to  N  40°  nnd 
longitude  W  76°  to  W  78°.  It  is  composed  of  64  "map  sheets" .  each 
containing  a  lexis'  map  area.  We  assigned  unique  feature  identifiers 
(names)  to  map  features  because  feature  mimhcrs  were  not  unique 
across  map  sheets.  I  licic  arc  no  feature  names  or  semantics  associated 
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I  iRiire  2:  Dl  Ms:  I’oljgon  I  Jatabasc  I 'or  Washington  I ).C.  Area 


I  ifiurt  4:  Maps:  (0\u  pi  map  Database  lor  figure,! 


figure  .1:  1)1  MS  Detail  of  Northwest  Washington  Area 


with  01  Ml  c 1 1 tries  prim. nil)  hcc.i use  the  database  was  not  Intended  In 
he  used  as  ,i  pencr.il  purpose  geographic  infonmuion  s\ stem.  The 
feature  Mender  niceh.inisni  allows  us  to  peilotut  r.iudoin  tieccss  to 
lentures  in  a  map  sheet  \\  c  c.u>  also  search  using  feature  attributes  such 
tis  feature  until)  sis  code,  feature  npc.  surface  material  code,  and  feature 
id  couc  I  his  t)pe  of  reorganization  is  necessary  to  support  an 
interactive  q  ncry- based  interface  for  human  and application  programs. 

I  igttre  2  shows  a  plot  of  polygon  features  in  the  area  corresponding 
to  our  entire  Washington  D.C.  datahasc.  figure  3  is  a  detailed  portion 
of  the  01  AD  database  centered  on  foggy  Bottom,  l  or  comparison. 

figure  •)  is  the  coricspoiiding  area  from  the  (met  piuap  database 
plotted  on  the  same  scale 

Some  of  the  tit  \t>  database  entries  are  easily  recognizable  as  natural 
or  man-made  le.itures.  although  as  discussed,  this  information  is  not  in 
the  original  database  itself,  f  igure  5  is  the  description  for  the  hd.il 
Basin,  figure  (>  is  the  Rochumbcuu  Bridge.  I  igurc  7  is  a  description  for 
a  large  irregular  area  In  central  Washington  D.C.  that  contains  the 
major  government  office  bindings.  I'lie  feature  name  assigned  by  maps 
is  the  first  entry  in  cadi  of  the  I  ignres. 

feature  * ri ?5 f 471a909 * 
feature*  header  4/1  (seek  72416) 
leature  inalysis  code:  1082 
feature  type:  areal  feature 
surface  Material  code:  (6)  water 
feature  id  code:  (909)  not  assigned 
subcategory:  fresh  water  (shallow) 
average  height  (neters):  0 

aerial  feature  471  polygon  with  76  vertices 
tree  cover:  0  roof  cover:  0  density  0 
min  point  (south  west)  5298.7979 
ma*  point  (north  east)  5067,8385 

I  igurc  5:  lit  At):  Description  for 'I  id.il  Basin 

feature  •«5M741Z50' 

feature  header:  474  ( Seek  :  73132  ) 

feature  .analysis  code  1085 

feature  type  linear  feature 

surface  Material  code  (3)  $»one  /  brick 

feature  id  code  (250)  not  assigned 

subcat.egory :  not  assigned  (general) 

average  height  (meters);  2 

linear  feature:  474  line  with  3  vertices 

width  24  reflectivity:  2 

f irst  point:  5024.8064 

last  point:  5192,8227 

figure  6:  diao:  Dcsctiotion  for  Kochainhcnu  Bridge _ 

feature  * d?5f 40?a0t0  ’ 
feature  header  402  (seek  63688) 
feature  analysis  code  1010 
leature  type:  areal  feature 
surface  material  code:  (3)  store  /  brick 
feature  id  code:  (CIO)  not  assigned 
subcalegor y .  institutional  (general) 
average  height  (meters)  28 

aerial  leature  402  polygon  with  ?1  vertices 
tree  cover:  10  roof  cover  70  density.  3 
min  point  (couth  west)  5705.7971 
max  point  (north  east)  6260.8/99 

figur  7:  lit  M):  Description  for  Government  Buildings 


6.4.2.  otfd:  Teriain  elevation  Database 

I  he  organization  of  the  digital  teir.iin  database  is  more 
sti.ughttoiw.iid.  I  he  till  l)  database  covers  the  same  geodetic  area  as 
our  lit  M>  data.  It  is  organized  into  M  raster  images  using  the  same 
image  format  as  our  digital  aerial  imagery.  Tacit  image  containing  a  15’ 
x  15'  array  of  lerr.im  samples,  wheic  each  “pixel”  is  a  discrete  elevation 
point.  I  lie  terrain  package,  iiivmion,  provides  a  transparent 
interface  in  the  nut)  database.  Users  can  retrieve  elevation 
i"!bi illation  based  on  rectangular  geodetic  area,  closest  sample  point  in 
a  geodetic  point,  or  by  weighted  interpolation,  ill  v  \ l tO\  uses  the 
C  MU  image  package  to  efficiently  buffer  blocks  of  contiguous  terrain 
data. 

6.5.  Conceptual  Map  Database 

I  be  map  database  component  of  '  aps,  conci TIMM’,  has  been 
described  in  McKcown  We  will  give  a  brief  overview  of  the 
organization  and  concentrate  on  our  new  work  in  hierarchical 
organization  and  feature  semantics. 

6.5.1.  Concept  Schema 

The  basic  entity  in  the  conci  I'lMAf  datahasc  is  the  concept  schema. 

I  lie  schema  is  given  a  unique  ID  hy  the  database,  and  the  user  specifies 
a  symbolic'  print  name  for  the  concept,  Tach  concept  may  have  one  or 
more  role  schema  associated  with  it.  Rule  schema  specify  one  or  more 
database  views  of  the  same  geographic  concept.  Tor  example, 
’northwest  Washington’  can  be  viewed  as  a  residential  area  as  well  as 
political  entity.  Another  aspect  is  die  ability  to  associate  the  same  name 

to  two  different  but  related  spatial  objects.  Consider  the  'kennedy 
center'  as  a  building  and  as  the  spatial  arc.i  (ie.  lawn,  p.uking  area,  etc.) 
encompassing  the  building.  The  principle  role  of  a  concept  schema 
indicates  a  preferred  or  default  view  I'lie  conci  pim  \p  database  is 
composed  of  lists  of  concept  schema. 

6.5  2.  Role  Schema 

The  role  schema  is  a  further  specification  of  the  attributes  of  the  map 
feature.  It  contains  the  role  name  ainibmc  (building,  bridge, 
commercial  area.  etc.),  a  v nlirolc  inline  attribute  (house,  museum, 
dormitory,  etc.),  a  role  clow  attribute  (ie..  buildings  may  lie  goi riiinwnl, 
rnnleimal.  commercial,  etc),  a  role  life  attribute  (ie.  physical, 
conceptual  or  aggregate),  and  a  role  tlcrinilion  attribute  (ie.  derivation 
method). 

The  role  name,  subrole,  and  role  class  attributes  categorize  the  map 
feature  according  to  its  function,  l  or  example:  Ibis  featnri  a 
building,  used  as  an  office  building,  used  for  government  purposes. 

I  he  role  type  attribute  describes  whether  the  map  fciturc  is  pltys  tally 
realized  in  the  scene,  or  if  it  is  a  conceptual  feature  such  as  a 
neighborhood,  political,  or  geographic  boundary  The  role  type 
alllibiitc  also  provides  a  mechanism  to  define  the  role  schema  as  a 
collection  of  physical  or  conceptual  map  features.  Tor  example,  the 
concept  schema  in  maps  for  'district  of  Columbia'  has  a  role  type 
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uggiegrsue  conceptual.  with  nggregratc  rules,  noithwest  Washington', 
tumhe.ist  Washington-,  southwest  Washington'.  .mil  'umlheasi 
ss.ishmgton-  I  Ins  nieeh.mism  .ill, ms  ilic  user  to  explicitly  represent 
concepts  that  .ire  snictlv  composed  of  other  role  sclieina.  Ihc  role 
demotion  attribute  describes  the  method  by  which  the  role  and  its 
associated  geodetic  position  description  were  added  to  the 
( OM  t  ft  map  database 

I  tell  role  schema  contains  a  <l)lt>  identifier  that  is  used  to  access  .1  set 
of  (  onciimm  \h  database  files  which  contain  geodetic  information 
about  the  map  feature  Ihcsc  identifiers  can  be  shared  when  multiple 
roles  have  tire  same  geodetic  dcscripiion.  as  in  tlic  pres  ions  example  of 
northwest  Washington'  viewed  as  both  a  residential  and  political  area. 

I  he  COM  ti’tMAl’  31)  description  allows  for  point,  line,  and  polygon 
features  as  primitives,  and  permits  the  aggregation  of  primitives  into 
more  complex  topologies,  such  as  regions  with  holes,  discommons  lines, 
and  point  lists,  \ssocialed  with  each  feature  that  was  accpnrcd  from  a 
image  in  the  database  is  the  genetic  name  of  the  image.  If  the 
correspondence  of  the  generic  image  changes  doc  to  the  addition  of 
more  ground  control  points,  or  belter  a  camera  model,  the  position  of 
the  ground  feature  can  he  automatically  recalculated. 

I  he  follow  ing  Is  the  set  of  files  associated  w  ith  each  rtrttr. 

•  |t»|  31)  geodetic  locution 

a  set  ol  <la[itiiclc/loiiginulc/clcvaiion>  triples  which 
define  the  geodetic  position  of  the  role 

•  (H'l  |  31)  feature  shape  description 

metric  values  for  Icnglri,  width,  area,  compactness, 
centroid,  fomicr  shape  approximation  etc. 

•  |i  c  |  feature  image  coverage 

•  a  list  of  generic  images  which  contain  this  feature 

•  image  mbr  and  feature  coordinates  for  each  image 

•  Il’ItOI’l  feature  property  list 

-  list  of  properties  of  the  map  feature 

•  some  general  propci  tics  such  as  'age',  capacity'.  '3D 
display  type' 

•  feature  type  specific  properties  such  as  'number  of  floors', 
basement',  'height .  and  'roof  type'  for  buildings 

6.5.3  Database  Query 

lOMTi'tMvr  supports  four  mcdiods  of  database  query.  The 
methods  arc  signal  access,  symbolic  access,  template  matching  and 
geometric  access.  I  he  following  table  gives  a  brief  description  of  each 
query  method 

•  signal  access 

Given  a  geodetic  specification  (point,  line,  area) . , 

perform  the  following  operations: 

•  display  all  imagery  at  w  Itich  contains  point,  line  or  area. 

•  retrieve  all  map  features  within  geodetic  specification 
retrieve  terrain  elevation 


•  symbolic  access 

Given  a  symbolic  name,  such  as  'treasury  building'  pet  form 
the  following  operations: 

■  convert  name  into  geodetic  specification  to  perform  signal 
access  opeiatious  listed  above 

tetrievc  database  description,  facts  and  properties  of  the 
map  feature 

■  rctiicvc  imagery  based  on  symbolic  (generic)  name 

•  template  matching 

Coven  a  partial  specification  of  symbolic  attributes  perform 
the  following  operations: 

find  all  map  features  which  satisfy  the  specification 
template  and  return  then  symbolic  name 
find  all  images  ami  return  symbolic  (generic)  name 

•  geometric  access 

Given  a  gcomctic  operation  such  as  'contains'  and  a 
geodetic  specification  perform  the  following  operations: 

-  find  .ill  map  features  which  satisfy  the  operation 
pci  formed  over  the  geodetic  specification  and  return  their 
symbolic  name.  ■  find  all  image  features  and  return 
symbolic  name 

Ihcsc  primitive  access  functions  can  he  combined25  to  answer 
queries  such  as.  display  images  of  foggy  bottom  heroic  l‘)77'.  'what  is 
the  closest  commercial  building  to  this  geographic  point',  and  Imw 
many  bridges  cross  between  Virginia  and  the  District  of  Columbia', 
figure  8  is  a  simple  schematic  giving  the  processes  hy  which  maps 
provides  signal  and  1 ymljvhe  access  into  the  CONCIpimaP  database  and 
display  of  the  query  result. 

6.5.4.  Spatial  Computation 

<ONCi  Pi  map  computes  geometric  properties  based  on  the  geodetic 
descriptions  associated  with  each  role  schema  in  the  database.  A  static 
dcscripiion  of  all  spatial  relationships  between  map  features  for 
contains,  subsumed  by.  intersection,  adjacency,  closest  point, 
partitioned  by  is  maintained  in  the  dalahasc. 

•  ’contains' 

-  an  unordcrcd  list  of  features  which  the  map  feature 
contains 

•  'subsumed  by 

•  an  unordcrcd  list  of  features  which  contain  die  map 
feature 


this  specification  may  be  in  fcodclic  coordinalcv  or  require  imagc-lo-map 
correspondence 
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•  intersection 

an  imnrdeicd  list  of  I'entiucs  which  intcrscci  the  map 
iV.itiuc 

•  'closest  point' 

Angle  feature  which  is  closest  to  the  map  feature 

•  'adjacency' 

•  .in  imouleretl  list  of  features  that  arc  within  .1  specific 
distance  of  the  map  feature 

•  partitioned  In' 

the  locus  f  points  where  two  areal  features  share  a 
common  boundary 

Ifoije  or  more  of  the  map  features  in  a  spatial  computation  is  a  result  of 
a  dynamic  query  (and  therefore  not  in  the  static  database),  these 
relationships  are  computed  as  needed.  \  simple  'memo'  function  is 
implemented  to  avoid  reeonipiil.iiion  of  dynamic  properties.  'I he  use 
of  the  static  description  can  also  he  'turned  off  to  evaluate  hierarchical 
search  as  described  ill  the  follow  ing  section. 


I  he  cost  1 1’lM  M’  database  stoics  both  factual  and  exact  information 
dcsuihing  the  spatial  relationship,  for  example,  if  two  features 
intersect,  the  list  of  geodetic  intersection  points  is  stored,  .is  well  ns  the 
•act  that  they  intersect  at  least  once  Ibis  is  necessary  for  query  which 
require  the  display  of  imagery  containing  a  geometric  fact  and  may 
possibly  he  useful  for  dose  1  thing  the  semantics  of  the  intersection.  In 
die  following  section  we  will  discuss  die  use  of  a  hierarchical 
oig.ini/.itioii  based  on  the  'contains'  relation  primitive,  and  show  Imw  it 
can  he  used  10  structure  the  spatial  database. 

6.6.  Hierarchical  Organization 
In  this  section  we  discuss  the  use  of  hiciaichica!  organization  of 
spatial  data  in  the  M  M's  system.  Ihe  <  t»\(  I  fIM  \l>  database  is  used  to 
huifd  a  Inn  lit  th\  me  data  structure  which  represents  the  whole-part 
relationships  and  spatial  containment  of  map  feature  descriptions.  I  his 
tree  is  used  to  improve  the  speed  of  spatial  computations  by 
constraining  search  to  .1  portion  of  the  database.  In  the  following 
sections  we  briefly  discuss  why  we  believe  this  is  a  good  alternative  to 
regular  spatial  decompositions  such  as  quadtree1516,  or  k-d  tree17 
usually  proposed  for  MI’I)  model  databases. 
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6  6  1  Rcyular  Decomposition 

Kcguhir  JiVoiiipoMtiom  stub  as  111  qu.idtiiv  organic, limits  do  not 
explicitly  exploit  the  inherent  simumc  m  sp.iti.d  oig.miz.ihons. 
I'r.iitu.d  implementations  of  these  oig.unz.  itions  often  use  tm.igcb.iscd 
(Integer)  conidmute  systems  and  tlteiel'ore  have  ,1  houmled  position 
lesolmum.  In  eener.it  cartographic  systems  it  is  import, mi  to  he  .ihle  to 
represent  .mil  ni.nilpiil.tie  lit  ip  feature  deseriptioiis.it  udie.illy  dilTereul 
resolutions  using  ,1  re.il  mined  coordinate  system  I  or  ex.imple. 
consider  .1  dynamic  query  th.it  results  in  the  cre.ition  of  .1  s cvy  small 
pol>gnn.il  are.i  \\  hen  computing  cont.nuineiit  or  imeiseciion  against  ,1 
static  m.ip  d.it.ih.ise  with  features  represented  as  ,1  quadtrees,  the 
quadtrees  lor  the  static  map  feature  must  lie  generated  to  a  much  finer 
lou’l  of  detail  in  order  to  coinpaie  the  two  data  structures.  Recent  work 
is  beginning  10  leprcsem  quadtrees  on  1c.1l  valued  coordinate  systems1', 
but  little  is  known  of  its  practical  1111plcn1em.1t  1011.  complexity,  and 
suit  aye  efficiency.  K  d  trees  show  storage  efficiency  improvements 
over  quadtrees17,  since  they  allow  lor  a  more  flexible  decomposition 
tailored  to  spatial  feature  density,  Howe'er,  they  have  the  same 
I’tmclameiu  il  limitations  when  used  to  represent  map  features  m  a  real 
valued  coo, dilute  system. 

In  MU’S  wc  perform  geometric  computations  on  the  feature  data  in 
the  geodetic  coordinate  system  using  point,  line,  and  polygon  as  map 
primitives.  Wc  constrain  senich  by  using  .1  hierarchical  representation 
computed  diiectly  from  the  undcilymg  map  data.  Ihcse  spatial 
coniramts  can  he  viewed  as  natural,  that  is.  intrinsic  to  the  data,  and 
may  have  some  analogy  to  how  humans  organize  a  "map  in  the  head" 
to  avoid  search  hot  example,  when  .1  tourist  who  is  looking  lor  the 
Watergate  Hotel  is  mid  that  the  building  is  In  Noilhwest  Washington, 
she  will  mu  spend  much  time  looking  at  a  map  of  Virginia.  Depending 
on  her  familiarity  with  the  aiea.  she  may  avoid  looking  at  much  of  the 
map  outside  of  the  Northwest  District.  \s  wc  begin  to  represent 
large  numbers  of  map  features  with  more  complex  interrelationships, 
we  believe  that  the  use  of  natural  hierarchies  111  urban  areas,  such  as 
political  boundaries,  neighborhoods,  commercial  and  industrial  areas, 
serve  tu  constrain  search.  Ilicy  may  also  allow  us  to  build  systems  that 
organize  data  using  spatial  relationships  that  arc  close  to  human  spatial 

models. 

6  6  2  Hierarchical  Decomposition 

I  he  hierarehie.il  contanmirnt  tree  is  ,1  tree  structure  where  nodes 
represent  map  Ic.itures,  hat  h  node  has  as  its  descendants  those  features 
that  it  completely  contains  in  (hil 1 1  luli'/h  1  >1^1 1  tnlc/ ch'uii nut ) space  Hie 
hici.irchic.il  tree  is  initially  generated  by  obtaining  an  unordcred  list  uf 
futiuc  (containment  list)  for  each  map  database  feature  Starting  with 
.1  designated  root  node  (  greater  Washington  d.c.  )  which  contains  all 
le, 'lures  m  the  database,  descendant  nodes  are  lecnrsively  removed 


horn  the  parent  node  list  if  they  are  already  contained  in  another 
descendant  node  I  he  result  is  that  the  parent  node  is  left  with  ,1  list  of 
descendant  features  that  are  not  com. lined  by  any  other  node.  Ihese 
descendant  nodes  form  the  next  level  of  an  N  ary  liee  ordered  by  the 
contains  lel.itionship.  Ibis  procedure  is  performed  recursively  for 
eveiy  map  feature  I  crmin.il  nodes  are  point  and  line  matures,  or  areal 
fc atmes  that  contain  no  other  map  feature.  We  will  discuss  the  point 
containment  and  closest  point  computation  using  the  hierarchy  tree  in 
the  following  section. 

I  ignic  9  shows  a  small  section  of  the  lnerarchie.il  containment  tree. 

1  lie  use  of  eoneeploal  feultircs--  features  with  no  physical  realization  in 
the  world  but  represent  well  understood  spatial  areas  can  he  used  to 
partition  the  database.  In  this  case  the  map  feature  ‘foggy  bottom' 
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If  vhc  is  told  dial  the  Watergate  is  also  near  the  Potomac  nver,  lhat  should 
further  constrain  tier  scaich,  bill  lhal  is  another  story 


f  igure  9:  MAI’S.  Hierarchical  Spatial  Containment 


dllous  IIS  to  p.irntion  sonic  ul  the  buildings  ,md  roods  tli.it  .ire  coiit.imcd 
"'til'll  ‘tioillmcsi  w.ishntginn  .  \s  more  neiglihorli  md  .irc.is  .nut  city 
districts  ore  ildded  to  our  U.it.ih.isc  we  expect  to  vo  improved 
pci  form, -nice  cspeei.illy  m  nrc.is  with  dense  fe.itnre  distrihmions.  Ibis 
will  ,ilso  improve  llie  rteliness  ol  the  sp.ui.il  description  .iv.iil.ihlc  to  the 
user 

6  6  3  Hierarchical  Scorch 

In  this  section  we  disuiss  die  use  of  our  hierarchical  org.nii/.itinii  to 
partilion  the  map  datahasc  to  improve  performance  by  decreasing 
seirch  when  computing  the  spatial  relationships  of  map  features.  The 
hierarchical  searching  algorithm  is  basically  an  N-ary  tree  searching 
algorithm.  Consider  a  user  at  the  conci  t'lM.M*  image  display  who 
invokes  the  geomelic  database  to  eoniptite  a  symliohc  description  of 
what  map  feature  lie  is  pointing  aL  I  irst,  using  imagc-io-map 
correspondence,  die  system  calculates  the  following  map  coordinates: 

latitude  N  38  53  49  (276) 

long  itudi*  W  77  03  53  (337) 

I  Ills  point  is  convened  into  a  temporary  map  database  feature  and  is 
tested  against  the  root  node  of  the  hieiaichv  tree.  If  it  is  not  outlined 
in  tn  s  node  (hoi  generally  the  ease),  then  the  point  cannot  correspond 
to  a  it.it, ihasc  I  eat  me.  and  the  search  terminates.  The  mcr  is  informed 
that  the  point  is  outside  the  map  data*-  •iso.****"*  If  the  '01111,11115'  lest 
succeeds,  it  re-curses  down  the  tree  .m  l  pafoims  the  test  against  die 
siblings  of  the  node  just  tested.  I  Ire  search  lows  several  paths  to  exist 
for  my  point,  thus  more  than  tine  sibling  niiy  contain  a  path  to  the 
point.  I  his  sott  of  anomaly  occurs  when  a  feature  happens  to  exist  in 
the  intersecting  region  of  two  larger  regions.  However,  if  (he  feature  is 
not  contained  by  the  node,  it  is  not  contained  by  any  of  the  node’s 
d-'ccndants.  and  that  portion  ol  the  tree  is  not  further  searched,  figure 
10  shows  the  answer  to  our  hypothetical  query.  I  he  query  point  is 
contained  within  thcodorc  rnoscvclt  island',  and  two  search  paths  in 
the  containment  tree  arc  given.  I  he  same  mechanism  is  used  for  line 
and  polygon  features,  although  the  primitive  determination  of 
containment  depends  on  the  geometric  type  of  the  feature. 


This  node  Delongs  in  the  following  place(s) 

3  entries  for  contains’  for  ’ theodore  roosevell  fsfand’ 
enlry  0  ’northwest  Washington' 

tfnlry  1  ’district  of  Colombia' 

e,) lfy  *  greater  Washington  d.c.' 

.  AMD . 

2  entries  for  contains'  for  theodore  roosevelt  island' 
entry  0:  potomac  river' 

«ntfy  greater  Washington  d.c 


figure  10’  maps;  Containment  Tree  '.ntry  for 
theodore  Roosevelt  Island 


This  can  actually  occur  since  users  arc  allowed  to  enter  arbitrary  coordinates 
Uirnuf.h  the  lermtnat  Therefore  the  database  has  some  crude  idea  of  ils  extent  of  map 
knowledge 


0,7  Toward  feature  Semantics 
We  have  begun  to  investigate  the  generation  of  map  feature 
semantics  directly  Irom  the  hieiaiehie.il  representation  of  the  imp 
feature  data.  \  simple  example  is  the  semantic  description  of  a  bridge 
the  fcaliiic  names  and  map  locations  that  it  connects  as  well  as  the 
names  of  the  map  features  that  it  crosses  over,  figures  II  and  12  sliuw 
the  result  nl  applying  a  pmee.hir.il  description  of  the  semantics  of  a 
bridge  eon, epi  to  calculate  the  connects  and  eiossovet  relationship 
using  the  map  tenure  descriptions  of  arlington  inemori.il  bridge'  and 
Theodore  roosevell  mcinou.il  bridge  I  liesc  lesulls  are  generated 
ducctlv  using  ne  Mvps  hierareliie.il  org.ini/.iiion  for  spatial  data  Wc 
do  not  pose  this  as  a  theory  of  map  feature  semantics,  hut  envision  a  set 
ol  feature  specific  piocedures  dial  can  build  these  types  of  descriptions. 


2  entries  for 
entry  0 
entry  1 

’contains  for  (juerypo  in  t  1’ 

Virginia' 

greater  Washington  d.c.’ 

2  entries  for 
entry  0: 
entry  1 

’contains'  for  querypoint  1‘ 
arlington  memorial  bridge 
’greater  Washington  d  c  ' 

4  entries  for 
entry  0 
entry  1 
entry  2 
entry  3 

’contains’  for  'querypoint  2' 
nail  area 

southwest  Washington' 
district  of  Columbia' 

‘greater  Washington  d  c  ' 

2  entries  f oi 
entry  0: 
entry  l 

contains  for  'querypoint  2' 
arlington  memorial  bridge 
greater  Washington  d.c  ‘ 

b  entries  for 
entry  0 
entry  1 
entry  2: 
entry  3 
entry  4 

intersection'  for  crossover' 

Virginia 

'district  of  Colombia 
'southwest  Washington' 
nail  area 

'potomac  river  (Role  0)' 

2  entries  for 
entry  0 : 
entry  l; 

'■onnects  for  arlington  memorial  bridge' 
Virginia' 

'mall  area' 

J  entries  for 
entry  0 

crossover  tor  ’arlington  memorial  bridge 
'  potomac  r  i  ver  ' 

figure  II:  \i  \ps:  Semantic  Computation  front  Spatial  Data 
Arlington  Memorial  bridge 

llie  procedure  for  bridge  semantics  is  as  follows:  A  bridge  can  be 
represented  in  die  c  onutimai-  database  as  an  polygonal  area,  a  list  of 
linear  segments,  or  as  a  geodetic  point.  I  he  polygonal  area  arises  when 
the  bridge  deck  is  represented,  the  list  of  linear  segments  approximates 
die  center  line  of  die  bridge,  and  the  point  feature  generally  represents 
(hat  the  bridge  is  a  l.mdinaik  feature  No  semantics  are  computed  in 
the  latter  case  If  the  bridge  is  represented  as  a  line,  the  end  points  arc 
selected,  otherwise  the  endpoints  of  the  major  axis  of  the  bounding 
ellipse  arc  retrieved  from  die  fcatu.  ;mi)  file  At  some  level  or 
description,  these  endpoints  define  the  connects  relationship  lint  this 
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.  AND . . . 

2  entries  fur  'contains'  for  querypomt  2 

entry  0.  theodore  roosow+lt  memorial  bridge 

entry  1:  'greater  Washington  d.c.' 


2  entiles  for  'connects'  for  'theodore  roosevoH  memoria 

entry  0;  Virginia' 

entry  1 ;  northwest  Washington' 

2  entries  for  'crossovei  for  'theodore  roosevelt  memnri 
entry  0  theodore  rooseveit  island' 

entry  1:  potomac  river' 


rigHfC  12:  maps:  Semantic  Computation  from  Spatial  Data 
riicoilurc  Koiibcult  Moiiniri.il  Bridge 
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I  iguic  13:  wasiim):  31)  M;ip  Display 


Figure  14:  WASH 3»:  Vertical  View  85n  Northwest  Wash' 


is  not  useful  if  we  are  envisioning  generation  of  a  reasonably  complex 
symbolic  representation. 

I  he  contains  relationship  is  applied  to  each  endpoint  using  the 
hierarchical  tree  to  order  the  search.  As  before,  this  search  returns  a  list 
of  features  01  dered  by  spatial  containment,  and  thete  may  be  several 
independent  contain  item  paths.  Redundant  paths  ate  eliminated  by 
examining  whether  die  bridge  is  in  die  containment  path.  The  first 
entry  (0)  in  each  of  the  remaining  paths  is  one  of  the  areas  connected  by 
the  bridge.  Using  the  'contains  lelationsliip.  the  other  entries  in  the 
path  are  also  valid  connecting  areas. 


7.  Synthesis  Tasks 

In  this  section  we  will  discuss  three  applications  of  the  maps  database 
to  cartographic  and  linage  interpretation  tasks.  I  hese  tasks  are  31)  scene 
generation  of  views  of  Washington  I),  C„  the  use  of  the  map  database 
to  guide  image  segmentation,  and  some  preliminary  results  on  a  rule 
based  system  for  airpi  rt  scene  interpretation  I  nch  task  requires  the 
capabilities  of  various  aspects  of  the  l\IP>  model  as  implemented  in  the 
MM’S  system.  I  hese  applications  poll  together  external  aim  iniage/nitip 
databases,  and  are  only  possible  using  an  integrated  system  that  relates 
imagery,  terrain,  and  map  data  dirough  a  unified  cartographic 
representation. 


lo  compute  the  'crossover'  relationship,  die  intersection' 
relationship  is  computed  for  the  bridge  using  the  complete  list  of  line 
segments  or  the  polygonal  description.  A  list  of  all  the  features  that  the 
bridge  intersects  is  assembled.  Tunics  in  the  intersection  list  are 
removed  if  they  are  also  present  in  either  of  the  'connects'  lists.  The 
assumption  is  that  those  features  that  didit  t  contain  a  bridge  endpoint, 
but  intersected  with  the  bridge  description,  are  those  features  that  the 
btidge  crosses  over.  If  there  is  sufficiently  detailed  elevation  data  for 
mail-made  features  it  should  be  possible  to  compute  semantics  for 
'passes  over  and  'passes  under'  by  calculating  the  feature  elevation  at 
the  actual  geodetic  point  of  intersection. 


u.  ou  scene  ueneratlon 

Die  first  application  of  die  Maps  database  is  in  the  area  of  31) 
computer  graphics  for  scene  simulation  and  database  validation. 
Computer  graphics  play  an  inipottant  role  in  the  areas  of  image 
processing,  photo-interpretation,  and  cartography.  In  cartography 
various  phases  of  the  map  generation  process  use  graphics  techniques 
or  source  matcii.il  aii.ilys.,.  transcription  and  update,  and  some  aspects 
of  map  layout  and  production.  However,  many  major  steps  in  the 
generation  of  a  cartographic  product  remain  largely  manual.  One 
important  step  for  which  inadequate  tools  exist  is  die  integration  of 
tci rain  and  cultural  feature  databases.  I  his  integration  step  is  often 


used  to  verity  the  geodetic  accuracy  of  luinral  and  mail made  features 
in  the  digital  database  prior  to  actual  map  layout  and  prodnetion. 
\ n other  application  is  sensor  simulation ,n  Radar.  \  isnal.  and  mulli- 
sensor  scenes  are  digitally  generated  to  verily  the  quality  of  digital 
culture  and  terrain  databases  or  to  determine  the  quality  of  the  sensor 
model.  Improvements  to  the  level  of  detail  contained  in  the  underlying 
database  can  be  subjectively  measured  in  terms  of  the  quality  of  die 
generated  scene. 

v> . 

vv  vsmn  •  is  an  interactive  graphics  system  that  uses  the  maps  system 
to  integrate  a  digital  terrain  database,  a  cultural  feature  database,  and 
the  CONCI  PIMAP  database  to  allow  a  user  to  generate  caruigraphically 
accurate  .'I)  scenes  for  human  visual  analysis,  \v\snti)  uses  the  coarse 
i cse d li 1 10 n  l  il  ms  database  described  in  Section  6.4  to  generate  a 
baseline  thematic  map.  I  lie  thematic  map  is  a  21)  image  winch  is 
produced  by  scan  conversion  of  the  til  MS  digital  feature  analysis 
database  (in  \li)  polygon  database.  We  assign  a  color  to  each  region 
polygon  using  the  in  Mi  surface  material  code--  forest  and  park  (green). 


water  (blue),  residential  (yellow),  and  high  density  urban  (brown), 
in  ms  terrain  elevation  data  (mi  l>)  is  interpolated  to  determine  ground 
elevations  at  each  point  in  the  21)  image.  Since  the  resolution  of  the 
ill  All  data  is  coarse,  comparable  to  map  scales  of  1.250,000  to 
1: 100.000.  wc  use  the  <  osn  ptMM’  dntuhusc  to  provide  high  resolution 
.11)  feature  descriptions  of  buildings,  roads,  bridges,  residential  and 
commercial  areas.  I  lietosn  pimap  database  is  derived  from  imagery 
with  resolutions  between  1 : 1 2000  and  I  .16000.  and  die  addition  of 
these  features  effectively  intensifies  the  perceived  level  of  detail  in  die 
simulated  scene,  even  though  the  base  map  is  al  a  coarse  resolution. 

I  tikes”  describes  die  utility  of  selective  database  intensification  for 
tailoring  standard  database  products  to  custom  applications  and  for 
tune-critical  applications  which  cannot  be  handled  by  normal 
production  schedules,  figure  1 .1  shows  die  interactive  process  by  which 
users  can  specify  an  area  of  interest  for  31)  scene  generation.  Figures 
14  and  15  show  two  31)  scenes  of  the  Washington  l).C.  area  generated 
bv  wash  3D. 
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7  0  2  MACHiNESCfi:  Map  Guide J  Machine  Segmentation 
lire  second  .ipptic.iiu >n  of  the  \i\i’S  database  is  in  i lie  aiea  of  map- 
guided  machine  segmental  ion.  Iscis  in  >  specif)  a  map  feature  from 
the  ( Osci  I’lMM'  database  or  inter, ictixely  generate  a  feature 
description  using  tire  si  r:\ti  M  program.  In  the  case  of  a  map  database 
lecture  muiiimsm;  uses  .in  existing  image  coxerage  |tt]  lile  (see 
Section  (>.5.2 )  that  specifies  in  which  images  the  I'eutuic  is  found,  and 
the  feature  location  in  the  image,  for  imeractixe  specification.  an  (tc) 
file  is  created  dynamically  b>  image- lo-map  correspondence  using  the 
image  database. 


I  or  each  image,  a  high  resolution  window  containing  the  database 
feature  is  extracted  and  displayed,  We  expand  the  si/e  of  the  image 
window  lo  contain  an  area  of  uncei taint >  around  the  feature  location, 
the  expansion  is  cm  remix  based  on  the  si/e  ol  the  feature,  but  we  plan 
lo  incorporate  correspondence  error  measures  based  on  the  quality  of 
the  camera  model  associated  with  each  image.  I'lie  image  window  is 
smoothed,  and  a  segmentation  is  performed  using  a  region-growing 
technique  xxhicb  combines  an  edge  slrengtb  metric  and  region  merge 
acceptability  based  on  spectral  similarity  to  control  region  grow  ing. 


l  igurc  1  Ci  shows  the  segmentation  of  scser.i!  low-elevation  buildings 
along  the  perimeter  of  the  Washington  ellipse.  I  he  uppermost 
building  is  added  to  the  CO\n  pim  \p  database  in  the  standard  manner 
desuihed  in  Section  (>.5  I  lie  user  specifics  the  image,  lx  jkm7.  to 
pci  form  the  segmentation  and  the  mac  him  si  c;  s>stem  automatically 
displays  a  reduced  resolution  window  of  the  image  (</i  .f.S'rt/7).  and  a 
high  resolution  window  (ellipse  area)  containing  the  database  aica. 
mac  ms:  sic  creates  a  copy  of  the  high  resolution  window  as  a  work 
arc  i  (vr  aside)  for  the  image  processing  routines.  An  image  smoothing 
operation  is  followed  by  the  gener.aion  of  seed  regions  using  a 
Conservative  similarity  measure  to  insure  that  potentially  matchable 
regions  are  not  prematurely  merged.  Ihe  initial  seed  regions  are 
overlaid  on  the  image  using  graphics  overlays.  Any  seed  regions  that 
satisfy  the  shape  criteria  for  die  database  feature  are  extracted  and 
marked  In  this  example,  the  database  feature  itself  was  marked  in  the 
mui.d  seed  region  matching.  As  regions  arc  merged  based  on  weak 
edge  boundaries  and  high  spectral  compatibility,  the  resulting  region  is 
evaluated  with  respect  to  a  list  of  shape  and  spectral  criteria.  If  the 
legion  satisfies  the  criteria,  it  is  in, uked.  and  further  merging  is  allowed 
only  if  the  proposed  merge  improves  the  overall  region  score.  Criteria 
include  fractional  fill,  area,  linearity,  perimeter,  compactness,  and 
spectral  measures. 

The  final  results  are  shown  in  the  second  window  labeled  set  aside. 

I  ive  buildings  similar  to  the  map  database  feature  were  correctly 
identified  while  one  building  was  omitted.  Six  segments  were 
incorrectly  identified.  Had  «e  made  use  of  spectral  information  in  this 
particular  segmentation  -  that  the  building  rixifs  were  bright 
features-  w  e  probably  could  have  excluded  5  of  the  fi  errors.  I  lowcvcr, 
we  arc  more  concerned  with  using  weak  knowledge,  and  one  cannot 
expect  hetter  performance  without  more  sophisticated  analysis. 
MACIIIM  SI  U  allows  the  user  to  delete  erroneous  segments  and 
generates  map  descriptions  of  each  extracted  feature,  These 
descriptions  can  then  be  used  to  search  for  these  features  in  other 
datahasc  imagery. 

I  he  significance  of  mac  iiim  SI  U  is  that  it  can  search  systematically 
for  features  in  a  database  of  images,  an  operation  that  is  fundamental 
for  change  detection  applications.  It  directly  uses  the  map  database 
description  as  an  evaluation  tool  for  image  segmentation  and 
interpretation.  It  also  uses  very  general  image  processing  tools  to 
perform  both  segmentation  and  evaluation  and  is  amenable  to 
supporting  other  approaches  to  image  segmentation  and  feature 
recovery.  A  further  application  of  the  maciiim  si  t,  system  is  discussed 
in  the  following  section. 

7  0  3.  SPAM  Rule-based  System  (or  Airport  Interpretation 

The  third  application  of  the  maps  system  is  in  die  investigation  or 
rule  based  systems  for  the  control  of  image  processing  and 
interpretation  with  respect  to  a  world  model. 


In  photo-interpretation,  knowledge  can  range  from  stereotypical 
information  ahout  man-made  and  natural  features  found  in  various 
situations  (airports,  manufacturing,  industrial  installations,  power  plants 
etc.)  to  particular  instantiations  of  these  situations  in  frequently 
monitored  sites.  It  is  crucial  for  photo-imeiprctation  applications  that 
the  metrics  used  lie  defined  in  a  cartographic  coordinate  system,  such  as 
<lalilnde/l,mgiiiide/ele‘ali,m>.  lather  than  an  image- based  coordinate 
system.  Descriptions  such  as  “the  runway  has  area  12000  pixels"  or 
"houses  are  between  211  and  345  pixels"  are  useless  except  fur 
(perhaps)  the  analysis  if  one  image.  It  is  the  case,  however,  that  to 
operation. ili/e  metric  knowledge  one  must  relate  the  world  model  to  the 
image  under  analysis.  Iliis  should  be  done  through  iinage-to  map 
correspondence  using  camera  models  which  is  the  method  used  in  our 
system, 

We  have  begun  to  build  spam’5  to  test  our  ideas  in  the  use  of  the 
combination  of  a  map  database,  task  independent  low-level  image 
processing  tools,  and  a  rule-based  system. 

SIMM  uses  ihe  MAI’S  database  to  store  facts  ahout  man-made  or 
n.itural  fcatuic  existence  and  location,  and  to  perform  geometric 
computation  in  map  space  rather  than  image  space.  Differences  in  scale, 
orientation,  and  viewpoint  can  be  handled  in  a  consistent  manner  using 
a  simple  camera  model,  flic  maps  database  facility  also  maintains  a 
partial  model  of  interpretation,  separate  from,  but  in  the  same 
representation  .is.  the  map  feature  daiabasc. 

Ihe  image  processing  component  is  based  on  the  MACillvisr.G 
program  described  in  die  previous  section.  It  performs  low  level  and 
intermediate  level  feature  extraction.  Processing  primitives  arc  based 
on  linear  feature  extraction  and  region  extraction  using  edge-based  and 
region  growing  techniques.  It  identifies  islands  of  interest  and  extends 
those  islands  constrained  by  the  geometric  model  provided  by  MAI’S 
and  model -based  goals  established  hy  the  rule-based  component. 

The  rule  based  component  provides  the  image  processing  system 
with  the  best  next  task  based  on  die  sircngth/promisc  of  expectations 
and  with  constraints  from  the  iniage/map  datahasc  system.  It  also 
guides  die  scene  interpretation  by  generating  successively  mure  specific 
expectations  based  on  image  processing  results. 

We  are  in  the  preliminary  stages  of  development  for  die  spam  system 
and  have  begun  to  build  a  Jctailed  map  model  of  National  Airport. 

'  igure  17  gives  an  example  of  the  ability  of  the  maps  database  to  use 
image- lo-map  correspondence  to  generate  unified  spatial  models  from 
partial  information.  The  line  drawing  labeled  wm  i\k,  contains  the 
northern  section  of  National  Airport:  x, stymie;  is  a  partially 
overlapping  southern  section  of  National  Airport.  I  inc  segments 
represent  point,  line,  and  areal  features  corresponding  to  runways, 
terminal  buildings,  access  roads,  and  hangars,  interactively  specified 
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Figure  17:  spam:  National  Airport  Spatial  Model 


37401  img 


unilied.img 


Unified  scene  of  Washington  National  Airport 


using  tlic  r'ONn  im MAI’  representation.  For  those  features  that  appear 
in  hotli  images,  the  concept  rob  mechanism  (sec  Section  6.5.2)  is  used 
to  specify  multiple  (t(Uiiuib/loogiiuib/chvotion>  descriptions.  A 
unified  map  description  is  created  hv  matching  corresponding  line 
segments  using  the  overlapping  image  areas  (in  map  space)  to  constrain 
search.  I  he  result  of  unification  is  the  line  drawing  labeled 
AIRPOKI  IMG. 

8.  Future  Work 

Our  future  work  will  he  directed  toward  two  research  topics.  First, 
we  have  only  begun  to  explore  the  use  of  MAI’S  as  a  component  of  an 
image  interpretation  system.  We  will  continue  our  work  in  the  airport 
scene  interpretation  task  using  the  SPAM  system  as  a  testbed  for 
integration  of  a  rule  based  system  w  ith  the  M  AI’S  system.  Second,  there 
is  much  to  do  in  expanding  the  COSCTHMAp  database  to  include  more 
complex  31)  descriptions,  and  in  attendant  issues  of  scaling  and  si/ing 
to  larger  databases.  Other  tasks  we  will  pursue  arc  the  evaluation  of  our 


Northern  section  ol  Washington  National  Airport- 


Southern  section  of  Washington  National  Airport 

hierarchical  spatial  icprcsentation  to  constrain  search  in  large  databases, 
general  solutions  to  complex  spatial  queries  for  situation  assessment 
applications,  and  the  application  of  spatial  knowledge  to  navigate 
through  a  map  database. 

In  discussing  future  work  it  is  important  to  understand  the  strengths 
and  limitations  of  the  current  research.  I  he  strengths  of  this  work  lie  in 
several  unique  features  of  die  maps  system.  First,  we  have  constructed 
a  system  of  moderate  complexity  which  has  significant  capabilitcs  in 
each  area  of  our  linagc/Map  Database  model.  The  system  integrates 
map  knowledge  from  diverse  sources  and  performs  several  tasks  that 
require  synthesis  of  this  knowledge.  We  have  the  ahility  to  represent 
complex  map  features  in  a  uniform  cartographic  coordinate  system  and 
can  compute  new  spatial  relationships  directly  from  the  map  data. 


The  major  limitation  in  the  MM’S  sworn  is  die  current  method  lor 
performing  image  to-nup  correspondence.  ITom  the  standpoint 

ol'  the  st, lie  of  the  eit  in  pliotuguii.nictty.  we  nuke  simplistic 
planemctric  assumptions  in  our  correspondence  algorithm.  hut  the)  do 
gilt  reasonable  results  for  several  reasons.  I- it'si  till  ol' our  photographs 
art  fcrtical  aerial  mapping  imager),  and  cfldits  are  taken  to  minimize 
e  imeru  tilt.  Second,  we  hate  set)  high  resolution  photographs,  each  of 
which  cotters  i  relatively  small  area,  and  due  to  the  relatively  local  level 
terrain  in  \\ ashington  I).  C..  our  pol) numi.il  eorresnondence  functions 
are  reason. ihl)  accurate. 

The  issue  is  not  how  to  recover  camera  information  from  die 
imager),  since  in  cartography  and  manual  pluno-iiiierpretation  the 
sensor  models  and  ephemeral  data  arc  well  known  and  modeled,  hut  to 
use  existing  photogr.unmetric  tools  for  basic  data  acquisition. 
Ihcrelorc.  in  this  limitation  we  sec  an  opportunity  to  investigate  how 
maps  could  he  interfaced  to  a  phoiogrammetiie  frnntcnd  which  would 
directly  provide  <liitituilc/langitudc/clc\iHtvn>  data  front  a  stereo 
model  I  he  fromend  should  have  a  landmark  database  and 

interactive  display  tools  to  game  the  stereo  model  setup  in  a  manner 
similar  to  our  current  implementation.  Nothing  in  the  current  maps 
implementation  precludes  such  an  interface  since  we  maintain  a  31) 
map  feature  representation  throughout  the  database  using  the  USGS 
terrain  database.  I  hc  building  of  such  tools  should  he  the  common 
objective  both  to  cartographers  and  to  computer  scientists. 
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w  ‘  ”8  System  Major  Components 

I  Ins  Ap|X'iidt\  contains  .1  list  of  the  major  program  modules  which 
compose  the  M  M'S  system. 


NAME  S17E  (bytes)  COMMf N [ S 


Browse 


browse 

306500 

interactive  inaije  display  facility 

p  icpac 

530702 

interactive  mage  processing  facility 

Cor  res 

cor  res 

200042 

interactive  inage-nap  correspondence 

check  corr es 

49523 

check  correspondence  errors 

corna in 

52893 

correspondence  algorithm 

corpa 1 rs 

751/0 

edit  correspondence  pairs  file 

creatsdf 

50601 

create  a  scene  description  file 

diinpcoef 

1 9  f  4  9 

dump  a  coefficients  file 

ihinpcor 

23547 

dump  a  correspondence  file 

dunpsdf 

25398 

dump  a  scene  description  file 

hypeorpairs 

82380 

generate  hypothesized  landmarks 

updatesdf 

69099 

update  a  scene  description  file 

Land nark 

1 andnar k 

194963 

interactive  landmark  extraction 

cre.it  1  dm 

23557 

create  binary  landmark  file 

ety  tod3 

502  1  7 

make  a  (13  file  frum  an  ety  file 

e  i  y  1 0 1  dm 

19948 

create  landmark  lile  from  ety  files 

lilescr  ibe 

43695 

give  landmark  descriptions 

1  dinr  i  pr  t 

30275 

dump  all  info  about  a  landmark 

1 duties t 

28696 

find  landmarks  within  neodetic  area 

Segnen  t 

segment 

1 70230 

hand  segmentation  program 

Ilk  idf 

10537 

create  ascii  file  from  binary  seg  file 

segronane 

39045 

edit  segmentation  region  names 

Machineseg 

1  iar  h  musey 

?90 2?2 

machine  segmentation  program 

Conceptnap 

conceptnap 

665710 

associate  conceptual  and  map  data 

bu  i  1 dsegmap 

98301 

build  composite  segmentations 

coetrack 

125241 

track  points  using  map  tor respondenco 

congeoal 1 

213278 

generate  geometric  database 

d  3(Junp 

24629 

dump  a  d3  file 

d3entcor 

93936 

create  torres  entry  from  .d3  file 

d3  f  dunp 

31039 

dump  a  d3  feature  file 

U3tod3f 

15826 

convert  a  d3  file  to  a  feature  file 

d3  to ing 

44710 

generate  binary  image  from  ,d3  files 

dlmsseg 

128324 

create  DIMS  overlay  for  geodetic  area 

dmaextrac  t 

31544 

extract  features  from  DIMS  fea  files 

dunpq 1 

207962 

dump  a  query  1 » s t  file 

dunpsdf 

25398 

dump  a  scene  description  file 

oedunp 

9425 

dump  the  contents  of  a  coverage  file 

ec  show 

137700 

display  manager  for  coverage  files 

ecsoi t 

26624 

sort  coverage  files  by  keys 

ectoseg 

18173 

create  seg  file  from  coverage  file 

hierarchy 

486262 

build  and  access  hierarchical  database 

h  ier  track 

321869 

track  and  display  pts  using  hierarchy 

i dh  ier 

254739 

identify  points  using  hierarchy 

inage  toec 

34283 

associate  image  with  coverage  file 

inagetomap 

54092 

<gener ic>vrow><col>  •>  < 1  at/ 1  on/e  1 ev> 

photo 

299710 

interactive  image  photogr ammetry 

seg  tod3 

57034 

convert  .seg  file  to  d3  data  structure 

seg to ing 

32785 

convert  .seg  regions  to  binary  image 

S  tnreoshow 

153125 

show  stereo  image  pairs 

un  i  fy  seg 

107603 

unify  segmentation  regions 

Wash3d 

wash3d 

76451  7 

3d  scene  generation  from  MAPS  database 
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Abstract 


Images  are  two  dimensional  projections  of 
chree  dimensional  scenes,  therefore  depth  recovery 
is  a  crucial  problem  in  Image  Understanding,  with 
applications  in  passive  navigation,  cartography, 
surveillance,  and  industrial  robotics.  Stereo 
analysis  provides  a  more  direct  quantitative  depth 
evaluation  than  techniques  such  as  shape  from  shad¬ 
ing,  and  its  being  passive  makes  it  more  applicable 
than  active  range  finding  imagery  by  laser  or 
radar.  This  paper  addresses  the  subproblem  of 
identifying  corresponding  points  in  the  two  images. 
The  primitives  we  are  using  are  groups  of  collinear 
connected  edge  points  called  segments,  and  we  base 
the  correspondence  on  the  minimum  /'differential 
disparity”  criterion.  The  result  of  this  process¬ 
ing  is  a  sparse  array  disparity  map  of  the  analyzed 
scene . 

h 


I.  Introduction 


The  human  visual  system  perceives  depth  with 
no  appare.it  effort  and  very  few  mistakes,  but  how 
it  does  so  is  not  understood.  Binocular  stereopsis 
plays  a  key  role  in  this  process,  and  the 
straightforward  extraction  of  depth  it  provides, 
once  corresponding  points  are  identified,  makes  it 
very  attractive.  Depth  recovery  is  necessary  in 
domains  such  as  passive  navigation[Gennery80, 
Moravec80] ,  cartography[Kelly77,  Panton78] , 
surveil lance [Henderson 79]  and  industrial  robotics. 
Proposed  so'utions  for  the  stereo  problem  follow  a 
paradigm  involving  the  following  steps [ Barnard82 ] : 


this  paper  is  solely  devoted  to  it.  The  next  sec¬ 
tion  reviews  the  existing  systems  that  have  been 
proposed  so  far,  divided  in  two  broad  classes, 
area-based  and  edge-based,  then  we  summarize  our 
assumptions  and  give  a  formal  description  of  the 
method.  The  fourth  section  presents  results,  and  we 
then  discuss  extensions. 


II.  Review  of  existing  methods 

Two  classes  of  techniques  have  been  used  for 
stereo  matching,  area-based  and  feature-based. 


2.1.  Area-based  stereo 

Ideally,  one  would  like  to  find  a  correspond¬ 
ing  pixel  for  each  pixel  in  each  image  of  a  stereo 
pair,  but  the  semantic  information  conveyed  by  a 
single  pixel  is  too  low  to  resolve  ambiguous 
matches,  therefore  we  have  to  consider  an  area  or 
neighborhood  around  each  pixel,  and  use 
correlation-based  matching  algorithms  to  determine 
the  corresponding  match,  it  is  therefore  using 
local  context  to  resolve  ambiguities.  The  jus¬ 
tification  for  such  an  approach  is  that  of 
"continuity",  that  is  disparity  values  change 
smoothly,  except  at  a  few  depth  discontinuities. 
All  systems  based  on  area-correlation  suffer  from 
the  same  limitations: 


They  require  the  presence  of  a  detectable 
texture  within  each  correlation  window, 
therefore  they  tend  to  fail  in  feature¬ 
less  or  repetitive  texture  environments. 


-image  acquisition, 
-camera  modeling, 
-feature  acquisition, 
-image  matching, 
-depth  determination, 
-interpolation . 


-  They  tend  to  be  confused  by  the  presence 
of  a  surface  discontinuity  in  a  correla¬ 
tion  window. 

-  They  are  sensitive  to  absolute  intensity, 
contrast  and  illumination. 


The  hardest  step  is  image  matching,  that  is  iden¬ 
tifying  corresponding  points  in  two  images,  and 
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-  They  get  confused  in  rapidly  changing 
depth  fields  (vegetation.) 

For  these  reasons,  the  existing  systems,  specially 
the  ones  used  in  "automatic"  cartography,  require 
the  intervention  of  human  operators  to  guide  them 
and  correct  them.  Such  systems  are  described  in 
[ Luc  as81 ,  Panton78,  Hannah80,  Barnard80, 
Moravec79] . 


2.2.  Feature-based  systems 


The  depth  information  in  stereo  analysis  is 
conveyed  by  the  differences  in  the  two  images  of  a 
stereo  pair  due  to  the  different  viewpoints,  the 
differences  being  most  prominent  at  the  discon¬ 
tinuities,  or  edges.  Obviously,  matching  of  fea¬ 
tures  will  not  provide  a  full  depth  mao,  and  must 
be  followed  by  an  interpolating  scheme.  The  common 
characteristics  of  feature-based  matching  tech¬ 
niques  are: 

They  are  faster  than  area -based  methods, 
because  there  are  many  fewer  points  to 
consider . 

-  The  obtained  match  is  more  accurate, 

edges  can  even  be  located  with  sub-pixel 
precision!  Bin?ord81  | . 

-  They  are  less  sensitive  to  photometric 

variations,  since  they  represent 

geometric  properties  of  a  scene. 

Henderson[Hcnderson79|  considered  scenes  represent¬ 
ing  cultural  sites  (man-made  structures)  and 
matched  edge  points  on  epipolar  lines  in  the  two 
views.  He  reduced  ambiguity  by  assuming  continuity 
between  consecutive  epipolar  lines.  Marr  and  Pog- 
gio  have  relied  on  two  apparently  simple 
constraints[Marr791: 

1 .  Unique  less  ■ 

Each  'joint  in  an  image  may  be  assigned 
at  me st  one  disparity  value.  One  may 
note  that  this  assumption  is  not  correct 
for  transparent  objects. 

2 .  Cont inuity , 

Matter  is  cohesive,  therefore  values 
change  smoothly,  except  at  a  few  depth 
discontinuities  . 

They  first  proposed  a  cooperative  algorithm(Marr76 ] 
that  works  very  well  on  random-dot  stereograms,  but 
they  rejected  it  to  propose  one  of  more  heuristic 
nature,  implemented  by  Grimson(Grimson79 , 
Grimson81]  that  generates  good  results,  given  the 
very  few  assumptions.  Arnold (Arnold78]  matches 
edges  using  local  context,  and  his  system  seems  to 
perform  well  on  cultural  scenes.  Finally,  Baker 
and  Bin  ford ( Baker82]  match  edges  on  epipolar  lines 
by  using  the  no-reversal  constraint  that  the  order 
of  the  match  has  to  be  preserved,  in  addition  to 
uniqueness  and  continuity.  They  also  consider  con¬ 
tinuity  by  examining  adjacent  epipolar  lines.  This 
system  appears  to  perform  reasonably  on  a  wide 
variety  of  images. 

In  most  of  the  systems  presented  above,  a  con¬ 
siderable  saving  in  search  time  is  obtained  by  a 
coarse  to  fine  matching,  that  is  the  matching  is 
originally  done  on  a  low-resolution  version  of  the 
image  and  the  results  are  propagated  to  the  higher 
resolution  version.  However,  it  should  be  noted 
that  in  current  implementations,  good  matches  as 
well  as  errors  tend  to  propagate  from  one  level  to 
the  next . 


III.  The  Minimal  Differential 
Disparity  Algorithm 

From  the  survey  conducted  above,  it  appears 
that  feature-based  techniques  are  more  appropriate 
to  solve  the  correspondence  problem,  but  edges  as  a 
primitive  seem  to  be  too  low-level,  and  a  connec¬ 
tivity  check  is  needed  to  remove  spurious  matches. 
High  level  primitives  such  as  physical  object  boun¬ 
daries  or  surface  descriptions  would  be  preferred, 
however,  stereo  processing  may  need  to  precede  the 
computation  of  such  descriptions.  As  a  step 
towards  higher  level  primitives,  we  are  using 
segments .  In  order  to  generate  them,  we  fit 
straight  lines  through  adjacent  edge  points  with  a 
given  tolerance  of  one  pixel.  These  segments  can 
be  desc ribed  by  : 

-  coordinates  of  the  end  points 

-  orientation 

-  strength  (average  contrast) 

By  using  these  primitives,  we  implicitly  assume  the 
connectivity  constraint.  When  matching  segments, 
we  need  to  allow  one  segment  to  possibly  match  with 
more  than  one  segment  in  the  other  image  (i.e.  to 
allow  for  fragmented  segments),  even  i f  we  wish  to 
preserve  unique  matches  for  the  individual  edge 
points.  Also,  instead  of  considering  one  epipolar 
line  at  a  time,  we  have  to  consider  all  epipolar 
lines  in  which  a  given  segment  appears. 


3.1.  Assumptions  and  Definitions 

We  consider  a  simple  camera  geometry  in  which 
the  epipolar  plane,  defined  as  the  plane  passing 
through  an  object  point  and  the  two  camera  foci, 
intersects  the  two  image  planes,  so  defining 
epipolar  lines  parallel  to  the  y  axis.  Therefore, 
corresponding  points  must  lie  on  corresponding 
epipolar  lines,  that  is  have  the  same  row  value, 
this  is  illustrated  in  Figure  3-1. 


Figure  3-1:  Collinear  Epipolar  Geometry 

from  (Baker82) 
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We  also  give  a  bound  op  the  disparity  range  allow¬ 
able  for  any  given  segment,  let  us  call  it  maxd. 

Let  A*{a<}  be  the  set  of  segments  in  the  left  image 
Let  B“{b'}  be  the  set  of  segments  in  the  right 
image  . 

Then,  for  each  segment  a^(r<.sp.  bj)  in  the  left 
(resp.  right)  image,  we  can  deTine  a  window 
w(  i)(  reap.  w(j))  in  which  corresponding  segments 
from  the  right  (resp.  left)  image  must  lie.  The 
shape  of  this  window  is  a  parallelogram,  one  median 
being  a- (reap,  b-),  the  other  a  horizontal  vector 
of  length  2*maxd .  One  can  see  that  Bj  in  w(j)  im¬ 
plies  b-  in  w(i), 

We  define  the  boolean  function  p(i,j)  relating  two 
segments  as: 

-  p(i  , j)  is  true  if 

-  b  •  overlaps  w(  i) 

-a-  ,  b-  have  "similar"  contrast 

-a.-  ,  b-  have  "similar"  orientation 


The  required  similarity  in  orientation  is  loose  and 
is  a  function  of  the  segment  length.  We  have  set 
it  to  be  25  degrees  for  long  segments  and  up  to  90 
degrees  for  very  short  segments. 

Two  segments  are  defined  to  have  similar  contrast 
if  the  absolute  value  of  the  difference  of  the  in¬ 
dividual  contrasts  is  less  than  202  of  the  larger 
one  . 

To  each  pair  (i,j)  such  that  p(i,j)  is  true  we  as¬ 
sociate  an  average  disparity  d—  which  is  the 
average  of  the  disparity  between  che  two  segments 
a^  and  b-  along  the  length  of  their  overlap. 

We  define  the  two  functions  SI  and  S?  as: 


Sl( ai)-{ j |  bj  in  w(  i) 

and  p(i,j)  is  true} 

S2(  a  j  )*{  j  |  b  j  in  w(  i) 

and  p(i,j)  is  false} 

Similarly,  we  define  Sl(b  )  and  S2(bj).  We  will 
also  need  the  value  cardta^),  which  is  the  number 
of  elements  in  the  set  Sl(a^)  S2(aj). 

It  is  to  be  noted  that  all  the  functions  described 
above  are  static,  meaning  that  they  are  computed 
only  once  . 


3.2.  Description 

Each  possible  match  is  evaluated  by  computing 
a  measure  of  the  distortion  this  match  provokes  for 
its  neighbors ,  i  .e .  given  that  (i,j)  is  a  correct 
match  with  its  associated  disparity  dj:,  how  well 
do  the  neighbors  agree  with  this  proposed  dis¬ 
parity?  We  compute  an  evaluation  of  the  match 

(i,j)  and  compare  to  the  matches  (i,k)  and  (h,j) 
for  k  in  Sl(a-)  and  h  in  Sl(bj).  If  the  evaluation 
is  minimum  for  (i,j),  then  j  is  the  preferred  in¬ 
terpretation  for  i  and  i  is  the  preferred  inter¬ 
pretation  for  j.  For  any  iteration  after  the  first 
one,  in  order  to  evaluate  a  match  (i,j),  we  only 
look  at  the  preferred  matches  for  the  reighbors  of 
i  and  J,  if^they  have  any.  Formally,  the  compu¬ 
tation  of  v  (i,j)  is: 


At  Iteration  1 


min  ldhk-d,H 

Wh> 


VS1(VUW  Vhj 


1  ^  “ln  ]/*rd(«  ) 

V  WV  J/' 


At  the  end  of  each  iteration,  we  define  the  seta 
Q( « 4 )  and  Q(b^)  •» 


J  in  Qfij)  and  i  in  Q(b^)  if 
Vk  ln  SK«(),  v'd.jl^v'd.k) 


Vh  ln  SI (b^ ) ,  v' ( 1, J) <vl (h, J ) 

For  any  iteration  after  the  first  one,  the  computation 


of  v  (i,J)  become* 


min  |d  ,-d 
bkt<J(.h) 


u)/card(Y 


VWLW  Vbj 


♦(  E  .;VdhkV)/c‘rdUl) 

b^l.jHlSjfa,)  V'l 


if  t  ne  seta  Q  are  not  empty,  otherwise  the  computation 
of  the  function  v  la  done  using  the  formula  for  Iteration  1. 

At  the  last  iteration,  only  those  elements 
that  have  a  preferred  match  are  considered  valid, 
and  a  disparity  map  array  is  filled  using  these 
values.  It  is  interesting  to  note  that  this  process 
is  absolutely  symmetric  in  the  two  views  and  there¬ 
fore  will  yield  identical  results  (except  for  the 
sign  of  the  disparity)  if  the  two  views  are  inter¬ 
changed.  It  is  helpful  to  look  at  a  simple  example 
to  understand  this  process. 


3.3.  Example 

Let  our  2  views  be  the  ones  shown  in  Figure 
3-2  below: 


Figure  3-2'  A  simple  example 


In  absence  of  any  extra  information,  the  correct 
interpretation  is  that  the  1  points  have  the  same 
disparity,  and  the  result  of  the  matching  is 
( aj  ,b  j)  for  1  in  {  1 , 2, 3}  . 

In  this  example,  Sl(  )-S  1  (b;  )-{ 1 , 2 ,  3}  and 

S2(aj  )  “S  2  ( b  j  )*  t>.  The  array  d-j  is 

0  1  2 
-1  0  1 
-2  -1  0 


Therefore  we  find 

vl(M)-  (|d22-d11|  +  |d33-du|)/3 

+  (|d22-dlll  +  ld33-dlll)/3 

-  0 

compared  to 

vl(l,2).  (|d23-d12i+|d33-d12|)/3 

4  (ld2rd12l"ld23-dl2l)/3 
*>  1 

and  to 

v  1  ( 1  ’  3>“  (!d22"di:l  +  |d32'd13l)/3 
4  (|di2-di3l+ldn-di3l)/3 

-  2.67 


The  calculations  are  similar  for  the  other  pairs, 
so,  at  the  end  of  the  first  iteration,  the 
preferred  interpretations  are  only  the  correct 
ones,  and  further  iterations  will  not  alter  the 
results  . 


3.6.  Discussion 

The  criterion  used  here,  namely  the  minimal 
differential  disparity,  has  similarities  with  the 
edge  interval  constraints  given  in  (Arnold80]  and 
subsequently  used  by  BakerlBaker  82],  but  looser  in 
the  sense  that  it  does  not  require  ordering  of  the 
edges.  Since  our  criterion  does  not  take  ordering 
into  account,  a  dynamic  programming  implementation 
is  not  possible.  Our  evaluation  function  is  more 
informed  than  Baker's  in  the  sense  that  it  con¬ 
siders  all  edges  in  a  neighborhood  instead  of  just 
the  predecessor  and  successor  of  a  given  edge.  The 
performance  of  this  algorithm  on  a  few  examples  is 
presented  next. 


IV.  Results 

It  is  difficult  to  display  results  of  stereo 
matching  meaningfully,  especially  in  a  two  dimen¬ 
sional  picture,  since  we  only  generate  a  sparse 
disparity  map.  We  will  simply  show  the  line  seg¬ 
ments  in  the  two  views  that  are  found  to  match.  We 
have  nor  been  ahle  to  master  the  art  of  cross-eye! 
stereo  fusion,  but  since  a  number  of  people  in  the 
field  are  good  at  it,  we  will  present  all  pairs  of 
images  according  to  its  convention,  that  is  the 


left  view  is  shown  on  the  right  and  the  right  view 
on  the  left.  All  results  w.ll  also  be  shown  this 
way,  without  explicitly  marking  each  point  and  its 
correspondence.  We  first  started  our  experiments 
with  very  simple  line  drawings,  slightly  more  com¬ 
plex  than  the  one  shown  in  Figure  3-2  and  the 
results  matched  the  expectations.  In  order  to 
remove  the  effects  of  the  segmentation  procedure  on 
the  performance  of  our  matching  technique,  we  hand- 
segmented  the  images  shown  in  Figure  6-1  by  tracing 
the  boundaries  of  the  objects  on  a  digitizing 
table.  This  image,  from  Control  Data  Corporation, 
is  synthetic  and  has  been  used  by  Baker]  Baker82  ] 
for  his  experiments.  The  resulting  segments  are 
shown  on  Figure  6-2  and  Figure  6-3  displays  the 
results  after  matching.  All  the  lines  that  have 

been  matched  have  the  correct  correspondence,  but 
some  matches  are  missed.  This  is  due  to  the  fact 
that  when  the  matcher  gets  confused  by  closely  com¬ 
peting  assignments,  it  chooses  not  to  assign  a 
label.  Also,  some  edges  are  not  matches  because  of 
mistakes  in  the  tracing  procedure:  we  traced  the 
boundaries  of  some  objects  in  opposite  directions 
in  the  two  views . 

For  all  other  examples,  edge  detection  was  per- 
foimed  automatically  using  a  technique  developed  by 
Nevatia  and  Babu] Nevat ia80]  that  finds  edge  mag¬ 
nitude  and  direction  by  convolving  the  image  with 
edge  masks  in  different  orientations  (we  used  5x5 
masks  in  6  directions  here).  These  edges  are  then 
linked  to  form  boundary  curves  which  are  ap¬ 
proximated  by  piecewise  linear  segments. 


Next,  consider  the  industrial  part  shown  in 
Figure  6-6,  the  original  resolution  is  256  by  256 
and  the  gray  levels  are  coded  on  8  bits.  We  ap¬ 
plied  the  matching  algorithm  to  two  different 
resolutions  of  the  image,  running  it  through  three 
iterations.  It  was  found  that  no  assignment  was 
changed  after  three  iterations  in  our  experiments. 
Figure  6-5  shows  the  original  edges  and  Figure 
6-6  displays  the  results  in  the  above  mentioned 
form.  Similarly,  Figure  6-7  shows  the  segments  at 
half  resolution  and  Figure  6-8  the  results.  Look¬ 
ing  at  the  segments  one  by  one,  we  did  not  notice 
any  spurious  assignment  at  either  resolution,  mean¬ 
ing  that  we  captured  the  shape  of  the  object,  even 
though  the  density  of  edges  is  much  larger  than  in 
the  previous  example. 


Another,  more  complex  image  is  shown  on  Figure 
6-9.  In  this  image,  we  have  a  wide  range  of  dis¬ 
parities,  a  change  of  sign  in  the  disparities 
across  the  picture,  various  occlusions,  the 
presence  of  a  repetitive  structure  (a  Rubik’s  cube) 
and  contrast  reversal.  We  do  not  expect  to  get 
good  results  with  this  contrast  reversal  since  one 
of  our  preliminary  conditions  is  similarity  in  con¬ 
trast,  but  the  other  peculiarities  are  very  inter¬ 
esting.  We  worked  at  low  resolution  on  the  seg¬ 
ments  shown  in  Figure  6-10  to  obtain  the  results 
shown  in  Figure  6-11,  The  interesting  points  are 
the  fol  lowing  : 


-  The  elongated  vertical  blocks  in  the  rear 
of  the  image  are  correctly  put  into  cor¬ 
respondence  . 


VI.  References 


-  All  the  squares  of  the  cube  that  should 
be  identified  are  correctly  matched.  The 
correct  labeling  appeared  at  iteration  2 
(at  iteration  1,  most  of  them  are  only 
ambiguously  matched.) 

The  segments  at  high  resolution  are  shown  in  Figure 
4-12  and  the  matching  results  in  Figure  4-13.  We 
did  not  use  the  results  at  low  resolution  to  guide 
the  matching  at  high  resolution,  therefore  the 
elongated  block  in  the  rear  right  is  not  matched 
any  longer.  It  is  interesting  to  note  that  the 
edges  caning  from  the  texture  of  the  wood  blocks  do 
not  create  confusion,  but  help  the  matching,  on  the 
front  cylinder  for  example.  Once  again,  most  as¬ 
signed  matches  are  correct. 


V.  Conclusions 

This  research  is  far  from  being  in  a  final 
state.  The  initial  encouraging  results  presented 
here  must  therefore  only  be  viewed  as  an  indication 
that  the  hypothesis  of  minimal  differential  dis¬ 
parity  may  be  usecul.  The  critical  points  that 
must  be  examined  are: 


-  Relax  the  contrast  constraint.  This  may 
be  done  by  considering  not  the  contrast 
of  an  edge,  but  the  intensity  values  on 
each  side.  Edges  could  then  be  matched 
if  either  their  left  side  or  their  right 
side  correspond.  One  may  eventually  con¬ 
sider  an  edge  as  a  doublet  [  3aker82 )  and 
match  each  side  separately. 

-  To  refine  the  formulation  of  the  evalua¬ 
tion  formula.  Statistical  analysis  may 
yield  better  functions,  maybe  by  intro¬ 
ducing  a  static  probability  measure  to 
evaluate  each  match  based  on  similarity 
of  intrinsic  properties  (length,  color, 
orientation.)  Also  of  concern  is  a  more 
accurate  definition  of  a  no-match  label, 
which  is  obtained  if  a  match  pair  is  not 
clearly  better  than  the  competing  ones. 

-  Further  extensive  testing  is  also  re¬ 
quired  on  aerial  and  near  range  imagery, 
with  terrain  models  for  accuracy  check¬ 
ing. 

-  Finally,  we  must  use  an  interpolation 
scheme,  very  likely  intensity-based,  to 
generate  a  full  disparity  map  of  the 
scene  depth. 
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Figure  4-1:  Synthetic  image  [256x256x6] 
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Hand  generated  segments 


Figure  4-4 


Figure  4-5:  Segments  from  the  full  resolution 


Figure  4-6 
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Results  at  half  resolution 
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Results  at  low  resolution 


Figure  4-12:  Segments  at  high  resolution 
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Results  at  high  resolution 
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ABSTRACT 

The  state  of  the  art  of  parallel 
processing  is  characterized  by  an 
extraordinary  proliferation  of 
architectures.  To  date,  little  work  has 
been  done  to  quantify  the  relative 
performance  capabilities  of  this  range  of 
architectures,  especially  from  the 
viewpoint  of  general  Image  Understanding 
(IU)  processing  requirements.  This  paper 
discusses  performance  evaluation  of 
parallel  hardware.  A  set  of  software 
metrics  is  proposed,  based  on  the 
processing  requirements  of  common  IU 
systems.  The  intent  of  the  paper  is  to 
serve  as  a  point  of  departure  for  further 


OVERVIEW 

The  Image  Understanding  (IU) 
community  is  faced  with  an  ever-expanding 
range  of  highly  parallel  computer 
architectures,  many  intended  to  uniquely 
match  special  processing  requirements 
within  that  discipline. A  key  issue  in 
the  development  of  such  machines  is  the 
degrea  to  which  they  meet  the  general 
needs  of  the  field.  To  date,  little 
attempt  has  been  made  to  objectively 
analyse  the  performance  characteristics  of 
these  machines,  when,  applied  to  typical  IU 
problems.  The  work  that  has  been  done  in 
this  area  7,8  has  so  far  been  limited  in 
scope  to  just  a  few  architectures,  and  to 
biomedical  image  processing  application 
areas.  No  concerted  effort  has  been  made 
to  study  the  full  range  of  parallel 
architectures,  within  the  broader  context 
of  Image  Understanding  and  scene  analysis. 


*  This  work  is  supported  in  part  by  the 
Defense  Advanced  Research  Projects  Agency 
of  the  Department  of  Defense,  and  was 
monitored  by  the  Wright  Patterson  Air 
Force  Base,  under  Contract  F-,33,61 5- 76 -C- 
1203,  DARPA  Order  No.  3119. 


A  partial  reason  for  the  lack  of  such 
comparative  analysis  is  the  absence  of  a 
set  of  widely  accepted  metrics  for 
parallel  hardware  performance  evaluation. 
Our  purpose  here  is  to  propose  a  set  of 
common  algorithms  for  use  as  performance 
evaluation  standards  within  the  field  of 
IU.  Hopefully,  the  current  paper  will 
stimulate  work  leading  to  a  set  of 
software  metrics  that  will  be  pertinent  to 
IU,  widely  accepted,  and  simple  to  apply. 


PERFORMANCE  EVALUATION  METHODS 

In  conventional  numeric  processing, 
there  are  two  commonly  used  methods  of 
performance  evaluation.  These  are  the 
instruction  mix  and  benchmark  program 
approaches.  In  the  instruction  mix 
approach,  a  set  of  programs  is  examined  to 
determine  the  number  of  times  each  type  of 
machine  instruction  occurs.  The  execution 
time  of  a  particular  processor  performing 
the  programs  in  question  can  then  be 
estimated  by  multiplying  its  execution 
time  for  each  machine  instruction  by  the 
number  of  times  that  instruction  occurred 
in  the  benchmark.  The  sum  of  all  such 
products,  one  for  each  machine 
instruction,  is  then  taken  as  the 
estimated  execution  time  for  the  program 
set  represented  by  the  instruction  mix. 

The  instruction  mix  approach  is 
useful  for  rapidly  evaluating  the 
performance  of  conventional  serial 
processors,  but  has  little  utility  for  the 
study  of  parallel  architectures.  This  is 
because  of  the  importance  of  data  movement 
in  IU  applications.  That  is,  two 
algorithms  might  show  identical 
statistics,  in  terms  of  the  numbers  of 
multiplications,  additions,  etc.  that  each 
require,  but  show  radically  different 
execution  times,  due  to  widely  differing 
data  movement  requirements.  For  example, 
one  algorithm  might  involve  only  data 
taken  from  a  relatively  small  kernel, 
while  the  other  might  require  global 
access  to  data  scattered  across  the  entire 


image  data  plane.  Furthermore,  the 
pattern  of  data  movement  is  often  at  least 
as  important  as  the  amount  of  movement 
itself,  since  different  architectures  are 
able  to  take  varying  advantage  of 
regularities  in  data  movement.  In 
addition  to  the  importance  of  data 
movement,  the  instruction  mix  approach  is 
unusable  in  IU  applications  because  of 
varying  efficiencies  in  parallel 
architectures.  Many  parallel  processors, 
particularly  SIMD  arrays  are  not  always 
able  to  use  all  of  their  available 
hardware  to  best  advantage:  Situations 
frequently  arise  in  which  a  significant 
portion  of  the  available  hardware  is  idle, 
due  to  the  lack  of  pertinent  image  data  in 
the  pixels  associated  with  it.  In  such 
cases,  a  simplistic  analysis  based  on 
aggregate  instruction  rates  leads  to 
performance  figures  significantly  higher 
than  can  actually  be  attained. 

The  benchmark  program  method  of 
performance  evaluation,  on  the  other  hand, 
avoids  many  of  the  problems  just 
mentioned.  In  this  approach,  a 
representative  algorithm  is  programmed  to 
run  on  a  particular  machine,  and  the 
actual  execution  time  is  measured.  If  the 
representative  pregram  is  properly  chosen, 
the  results  are  virtually  guaranteed  to  be 
accurate,  since  such  matters  as  data 
movement,  machine  efficiency,  and  the 
operating  environment  are  naturally 
included  in  the  final  measurement. 

The  problem  with  the  benchmark 
program  approach,  of  course,  lies  in  the 
choice  of  the  "representative  program." 
Particularly  in  a  field  as  broad  as  IU 
currently  is,  it  would  be  a  hopeless  task 
to  map  £yery  interesting  algorithm  onto 
every  proposed  architecture.  The  problem 
of  selecting  representative  algorithms  is 
itself  complicated  by  the  breadth  of  the 
field,  the  wide  range  of  approaches  to  any 
given  IU  sub-task,  and  by  the  fact  that 
there  are  many  areas  in  which  there  is  no 
clear  consensus  as  to  the  best  algorithm 
for  performing  a  given  task.  These  are 
the  parameters  within  which  we  must  work. 
They  are  further  modified  by  a  strong 
desire  to  reduce  to  an  absolute  minimum 
the  amount  of  coding  required  to  implement 
an  evaluative  test.  Ideally,  what  we 
would  like  to  do  is  to  find  the  lowest 
level  of  program  modules  with  the  greatest 
degree  of  applicability  across  the  entire 
range  of  IU  algorithms. 

In  selecting  representative 
algorithms  or  modules  though,  we  must  be 
particularly  careful  to  not  only  represent 
the  full  range  of  application  requirements 
(such  as  feature  extraction, 
classification,  etc.),  but  to  include  as 
well  the  entire  range  of  processing 


requirements,  as  seen  by  the  hardware.  In 
other  words,  while  covering  the  entire 
range  of  IU  algorithms,  we  must  (since  our 
objective  is  hardware  evaluation)  focus 
more  on  the  processing  load  in  making  our 
selections,  rather  than  on  the  overall 
structure  of  any  particular  algorithm. 
Accordingly,  we  need  to  develop  a 
conceptual  basis,  or  taxonomy,  by  which  we 
might  determine  the  unique  processing 
requirements  of  each  algorithm  studied. 


SOFTWARE  TAXONOMIES 

Little  work  has  been  done  to  date  on 
the  classification  of  software  algorithms. 
Swain,  Siegal,  and  El-Achkar  ®  proposed  a 
six-point  classification  scheme  which 
consisted  of  the  following  catagories: 


•  Type:  -  Enhancement 

-  Extraction 

•  Context 

Dependency:  -  Context  Free 

-  Context  Dependent 

•  Iteration:  -  Single-Pass 

-  Multi-Pass 

•  Mu ltivar iacy:  -  Univariate  Data 

-  Multivariate  Data 

•  Time:  -  Real-Time 

-  Batch 

•  Computational 

Complexity:  -  n,  n  log(n),  etc. 


Swain's  "type",  "iteration",  and 
"time"  classifications  are  self- 
explanatory.  An  example  of  a  "context- 
free"  algorithm  might  be  histogramming, 
where  the  set  of  final  values  is  solely  a 
function  of  the  values  of  the  individual 
pixels,  independent  of  any  relative 
associations  that  might  exist  between 
pixels.  An  example  of  a  "context- 
dependent"  algorithm  would  be  one 
performing  adaptive  filtering  in  which  the 
output  value  (and,  indeed,  the  structure 
of  the  algorithm  itself)  for  a  given  pixel 
would  depend  strongly  on  the  values  of  the 
pixels  surrounding  it.  A  simple  grey¬ 
scale  image  would  fall  under  the  catagory 
of  "univariate  data",  while  a  Landsat 
mu  It i -spec t ral  image  would  be  considered 
"multivariate  data".  Finally, 
"computational  complexity"  refers  to  the 
relative  dependence  of  the  execution  time 
of  an  algorithm  on  the  size  (n)  of  the 
data  being  manipulated. 

These  classification  criteria  focus 
more  on  the  use  to  which  various 
algorithms  are  put  than  on  their 


structure.  As  a  result,  little  explicit 
information  is  conveyed  regarding  the 
processing  requirements  of  the  algorithms 
so  classified.  We  propose  as  a  more 
useful  set  of  criteria  the  following: 


•  Functional  Statistics 

•  Local  vs.  Global 

•  Memory  Intensive  vs. 

Computation  Intensive 

•  Context  Dependent  vs.  Context  Free 

•  Iconic  vs.  Symbolic 

•  Object  Oriented  vs. 

Coordinate  Oriented 


Here,  the  term  "functional 
statistics"  simply  refers  to  statistics  of 
the  sort  normally  used  in  compiling 
representative  instruction  mixes.  In 
particular,  we  refer  to  the  relative 
frequency  of  various  arithmetic 
operations,  such  as  addition, 
multiplication,  division,  etc.  Such 
statistics  are  important  in  evaluating  the 
performance  characteristics  of  specific 
machines,  but  are  less  valuable  in  the 
study  of  general  architectures.  Apart 
from  gains  attributable  to  the  level  of 
parallelism  employed,  arithmetic 
performance  is  more  a  function  of 
implementation  than  architecture. 

The  local/global  distinction  in  our 
classification  scheme  is  really  a  measure 
of  the  a  priori  knowledge  concerning  data 
location  contained  in  an  algorithm,  rather 
than  an  indication  of  the  size  of  the 
domain  upon  which  an  algorithm  operates. 
That  is,  we  consider  the  scope  of  an 
algorithm  to  be  "global"  if  its  domain 
cannot  be,  a  priori,  restricted  to  any 
subset  of  the  i  maqe  data.  Thus,  a 
"global"  algorithm  is  one  which  way  draw 
its  data  from  anywhere  in  the  image  plane, 
whether  or  not  it  does  so  in  all  cases. 
This  distinction  arises  from  the  scope  of 
the  data  access  required  by  the  individual 
processing  elements  in  a  parallel 
architecture.  If  an  algorithm  may  require 
a  processor  to  have  access  to  any  area  of 
the  image,  the  architecture  must  provide 
for  such  arbitrary  access.  From  the 
viewpoint  of  the  computer  architect,  the 
fact  that  only  a  small  number  of  pixels 
will  be  involved  in  a  given  operation 
matters  less  than  the  fact  that  those 
pixels  may  lie  anywhere  in  the  image 
plane. 

The  memor y/compc t a t ion  intensive 
classification  is  a  measure  of  the  amount 


of  local  memory  that  will  te  needed  to 
successfully  execute  an  algorithm.  "Local 
memory"  is  taken  to  mean  that  memory 
associated  with  each  sub-processor  in  a 
parallel  machine.  This  memory  is  used  to 
store  such  things  at  ’•aw  image  data, 
convolution  coefficients,  or  intermediate 
results.  An  example  of  such  usage  would 
be  the  need  to  store  edge  magnitudes  for 
each  of  several  edge  directions,  in  most 
edge-detection  algorithms.  Operations 
based  on  sorting  within  a  kernel  (eg: 
median  filteiing)  in  particular  require 
large  amounts  of  local  memory. 

As  with  Swain's  classification,  we 
take  context  dependency  to  mean  the  extent 
to  which  the  values  output  by  an  algorithm 
depend  on  relationships  existing  between 
input  data  elements.  Note  that  this 
definition  of  context  dependency  does  not 
refer  to  situations  in  which  the  output  of 
an  algorithm  depends  non-linearly  on  the 
input  values  (as  with  thresholding).  Such 
behavior  is  described  by  our  linear/non¬ 
linear  classification.  The  more  commonly 
employed  term  of  "data  dependency"  refers 
to  both  context  dependency  and  linearity. 
We  have  chosen  to  distinguish  these  two 
cases  as  separate  classification 
parameters  because  of  their  different 
implications  for  hardware. 

Our  "iconic  vs.  symbolic" 
catagorization  is  included  because  the  two 
representations  involve  greatly  different 
types  of  processing.  In  iconic 
processing,  there  is  a  direct  relationship 
between  physical  storage  locations  and 
image  pixels.  Symbolic  processing,  on  the 
other  hand,  involves  the  manipulation  of 
lists  and  other  data  structures  which 
contain  image  coordinates  only  as  explicit 
entries  in  the  data  structure.  Machines 
well  suited  to  iconic  processing  are 
largely  unsuited  to  symbolic  processing, 
and  vice  versa.  Most  of  the  concurrent 
architectures  proposed  to  date  have  been 
of  the  iconic  type. 

The  problem  of  iconic  vs.  symbolic 
processing  goes  beyond  the  simple 
dichotomy  of  the  classification,  however. 
Modern  image  understanding  frequently 
involves  the  translation  of  image  data 
from  the  initial,  iconic,  form  to  a 
subsequent,  symbolic  one.  A  great  deal  of 
the  processing  load  of  advanced, 
autonomous  systems  actually  occurs  on  the 
symbolic  level,  in  the  application  of 
knowledge-based  rule  systems  to  the  raw 
data  gathered  by  the  lower  levels  of  the 
vision  system.  Both  sorts  of  processing 
are  therefore  important  to  practical 
applications.  Significantly,  though, 
while  we  know  fairly  well  how  to  build 
machines  that  are  capable  of  processing 
either  iconic  or  symbolic  data,  no  current 


architectures  adequately  address  the 
problem  of  translation  between  the  two 
domains.  The  difficulty  of  this 
translation  lies  in  the  fact  that  it  is 
basically  an  object-oriented  process.  The 
data  corresponding  to  a  particular  object 
might,  lie  anywhere  within  the  image  plane, 
making  it  difficult  to  make  any  a  priori 
assignments  of  individual  processors  to 
individual  objects  in  a  multiprocessor 
architecture.  Similarly,  SIMD  machines 
can  only  translate  between  iconic  and 
symbolic  representations  one  object  at  a 
time,  due  to  their  single  instruction 
stream.  There  are  other  considerations 
involved  here  that  extend  beyond  the  scope 
of  this  paper,  but  suffice  it  to  say  that 
the  iconic/symbolic  translation  problem 
remains  difficult  and  as  yet  unsolved.  It 
's  for  this  reason  that  we  have  included 
"object  orientated  vs.  coordinate 
oriented"  in  our  list  of  classification 
parameters.  Coordinate  oriented 
processing  refers  to  situations  in  which 
the  location  of  the  data  to  be  processed 
within  the  image  is  known  in  advance, 
independent  of  any  characteristics  of  that 
data.  On  the  other  hand,  in  object- 
oriented  processing,  the  location  of  the 
data  to  be  processed  is  an  implicit 
function  of  the  data  itself,  and  of 
relationships  existing  within  the  data. 
The  consequence  for  hardware,  as  mentioned 
earlier,  is  that  object-oriented 
processing  requires  access  to  the  entire 
image  plane.  Few  architectures  provide 
such  access  while  allowing  independent 
processing  of  various  parts  of  the  image. 

This  list  of  classification 
catagories  provides  a  basis  for  a  study  of 
algorithm  characteristics,  as 
distinguished  by  the  demands  placed  on  the 
processing  hardware.  We  will  use  them 
subsequently  in  our  discussion  of  IU 
processing  requirements,  and  again  in  our 
overview  of  current  architectures. 


IU  ALGORITHM  OVERVIEW  &  METRIC  SET 

In  this  section,  we  briefly  review 
the  most  common  types  of  processing 
encountered  in  image  understanding. 
Before  launching  directly  into  this 
discussion,  though,  it  would  be 
appropriate  to  consider  the  level  of 
algorithms  that  would  be  most  profitable 
to  study.  We  would  like  to  find 
algorithms  or  operations  which  enjoy  wide 
application  across  the  entire  IU  field. 
We  must  balance  this  desire  against  the 
requirement  that  the  algorithms  selected 
be  uniquely  representative  of  the 
requirements  of  IU,  as  identified  by  our 
taxonomy.  Obviously,  the  operations  of 
addition  and  multiplication  are  widely 
used  within  IU.  By  themselves,  though, 


they  do  little  to  represent  the  unique 
requirements  of  the  discipline.  On  the 
other  hand,  the  "Smith,  Smith,  Smith,  and 
Jones"  matched  filter  for  '57  Chevys, 
while  highly  developed,  has  only  a  limited 
range  of  application. 

Accordingly,  we  have  selected  a  set 
of  "unit  operations"  which  function  at  a 
low  enough  level  that  *-hey  may  be  employed 
by  a  wide  range  of  higher  level 
algorithms,  but  that  are  themselves  of  a 
sufficiently  high  level  to  be  classified 
according  to  our  previously  developed 
taxonomy.  Table  I  lists  the  unit 
operations  that  we  have  selected  for 
consideration,  and  shows  how  they  fit  into 
our  classification  scheme.  A  discussion 
of  the  unit  operations  and  their 
classification  follows. 

It  is  important  to  note  in  the 
following  discussion  that  many  of  the 
operations  described  can  be  used  in  ways 
contradictory  to  their  primary 
classification.  This  does  not  invalidate 
in  any  way  their  selection  as  part  of  the 
metric  set,  based  on  our  classification  of 
them.  Our  intent  here  is  not  to 
rigorously  classify  the  algorithms, 
including  all  variations  of  their  usage, 
but  merely  to  insure  that  we  have 
adequately  accounted  for  the  various  types 
of  processing  represented  by  our  taxonomy. 

Thresholding,  the  first  entry  in 
Table  I,  finds  broad  application 
throughout  IU.  It  was  chosen  for 
inclusion  in  the  metric  set  as  the 
simplest  example  of  a  parallel,  non-linear 
operation.  As  for  its  other  parameters  in 
the  classification  scheme,  it  is  local, 
because  thresholding  by  definition  takes 
as  input  only  the  values  of  individual 
pixels.  Thresholding  may  be  either 
context-free  or  context-dependent, 
depending  on  whether  it  is  being  done 
adaptively  or  not.  If  the  threshold  value 
is  the  same  for  all  pixels  of  the  image, 
the  operation  is  context-free.  On  the 
other  hand,  if  the  threshold  is  set 
locally,  as  some  function  of  local  data 
values,  the  operation  is  context- 
dependent.  Examples  of  both  types  of 
thresholding  would  be  good  candidates  for 
inclusion  in  the  metric  set.  Since 
thresholding  does  not  typically  involve 
the  storage  of  intermediate  results, 
little  local  memory  is  required,  and  the 
operation  is  therefore  considered  to  be 
computation-intensive.  Thresholding  is  a 
coordinate-oriented  operation,  even  in  the 
adaptive  case,  because  the  data  required 
to  generate  a  given  result  always  lies 
within  a  small  area  surrounding  the  pixel 
being  processed.  Likewise,  the  operation 
is  strictly  iconic. 


Convolution,  the  second  table  entry, 
is  widely  employed  in  filtering  functions 
such  as  edge  detection,  and  as  part  of 
such  procedures  as  connectivity  linking, 
and  region  growing.  It  is  basically  a 
linear,  arithmetic  process,  in  which  the 
data  values  within  a  local  neighborhood 
are  multiplied  by  a  set  of  weight  values, 
and  the  resulting  products  are  summed  to 
produce  the  final  result.  Such  a  sum-of- 
products  is  computed  for  each  pixel  of  the 
input  image.  As  we  have  just  stated, 
convolution  is  an  example  of  a  linear, 
local  operation.  In  some  situations,  the 
output  is  made  non-linear,  but  this  occurs 
through  thresholding,  which  has  already 
been  included  in  the  metric  set.  In  its 
purest  form,  convolution  is  context-free, 
with  the  weighting  function  being 
invariant  across  the  image.  In  some  forms 
of  adaptive  filtering,  a  multi-pass 
iteration  is  applied,  with  the  local 
weighting  functions  being  modified  by  the 
results  of  the  earlier  pass.  An  example 
of  such  usage  would  be  an  algorithm  to 
extract  the  lines  forming  the  loops  and 
whorls  of  fingerprints.  In  this 
application,  the  weighting  values  of  a 
1 i n e- d e t e c t i n g  filter  are  modified 
according  to  the  dominant  local  line 
direction  found  on  a  previous  pass.  As 
with  thresholding,  both  adaptive  and  non- 
adaptive  forms  of  convolution  processing 
should  be  included  in  the  metric  set. 
Convolution  by  itself  is  computation- 
rather  than  memory-intensive.  Some  of  its 
applications  do  involve  the  storage  of 
intermediate  products,  however.  Edge- 
detection,  for  example,  usually  involves  a 
series  of  convolutions,  one  in  each  edge 
direction  being  tested  for,  with  the 
intermediate  results  of  each  individual 
convolution  being  stored  for  subsequent 
comparison  and  selection  of  the  largest 
directional  value  at  each  point.  We 
consider  this  to  be  an  example  of  sorting, 
though,  which  we  treat  separately  as  the 
third  entry  of  the  table.  Adaptive 
filtering  can  also  involve  substantial 
amounts  of  local  memory,  depending  on  the 
architecture  of  the  machine  being  used. 
In  some  machines,  particularly  cellular 
arrays,  the  various  weighting  coefficients 
for  each  Of  a  range  of  possible 
convolutions  are  all  stored  in  the 
machine's  local  memory.  This  is  the 
primary  incentive  for  including  adaptive 
convolution  in  the  metric  set.  As  to  its 
other  classification  parameters, 
convolution  is  strictly  coordinate 
oriented  and  operates  within  an  iconic 
representation. 

Sorting,  the  third  entry  in  Table  I, 
is  included  as  an  example  of  a  local,  non¬ 
linear,  memory-intensive  process.  As  just 
mentioned,  it  finds  application  in 
operations  such  as  line-finding,  where  the 


largest  of  a  set  of  several  values  must  be 
selected.  Mediar  filtering  likewise 
involves  the  selection  of  the  median  value 
from  among  a  number  of  data  values 
occurring  in  a  local  neighborhood. 
iMedian  filtering  is  most  often  used  for 
size  discrimination  and  connectivity 
processing.)  Sorting  is  best  thought  of 
as  context-dependent,  since  the  shuffling 
of  pieces  of  data  or  the  setting  of 
pointers  is  strictly  a  function  of  the 
data  values  themselves.  It  is  also 
coordinate-oriented,  but  can  occur  in 
either  an  iconic  or  symbolic 
representation . 

Histogramming  is  our  fourth  candidate 
for  inclusion  in  a  set  of  software 
metrics.  Histogramming  is  a 
representative  of  what  might  be  called 
"statistical  processing,"  and  constitutes 
a  large  portion  of  the  computational  load 
of  the  popular  region-splitting 
segmentation  algorithms.  Its  processing 
requirements  differ  from  those  of  the 
operations  detailed  so  far,  in  that  it 
operates  globally  in  a  context-free 
fashion.  It  is  most  properly  thought  of 
as  computation-intensive,  since  memory  is 
only  required  for  the  storage  of  the  final 
tallies.  On  the  other  hand,  on  MIMD  and 
pipelined  machines,  its  actual  execution 
is  memory  accfiSS-intensive,  in  that  a 
small  set  of  memory  locations  are  accessed 
repeatedly  as  the  individual  tallies  are 
updated.  Histogram  computation  is  also 
linear  with  respect  to  the  input  values, 
and  is  most  often  employed  in  an  iconic 
context.  While  we  have  classified 
histogram  processing  as  being  coordinate- 
oriented,  it  could  be  argued  that  many 
applications  use  it  in  a  object-oriented 
manner.  An  example  of  such  usage  would  be 
situations  in  which  histograms  are 
generated  for  a  class  of  objects  within 
the  image,  as  is  common  in  target- 
identification  routines.  This  object 
dependency  is  not  an  intrinsic 
characteristic  of  the  histogramming 
process,  though,  but  rather  an  extra, 
externally-applied  condition.  Hence  our 
"coordinate-oriented"  classification. 

Correlation  operations  were  selected 
as  the  fifth  metric  set  candidate  because 
they  involve  local  processing  with  a 
higher  level  of  context-dependency  than  we 
have  previously  encountered  in  this 
discussion.  As  typically  applied  (in 
stereo  processing),  portions  of  one 
picture  are  compared  against  various 
regions  of  another  picture.  The 
comparison  process  is  basically  a 
convolution,  but  the  weighting  functions 
are  the  data  values  of  the  reference 
image.  Correlation  thus  tests  an 
architecture's  ability  to  rapidly  access 
different  sets  of  weighting  values  for 
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convolution  processing.  This  sort  of 
operation  is  common  in  both  feature-  and 
intensity-based  stereo  processing.  In 
most  implementations,  it  is  also  somewhat 
object-oriented,  in  that  the  correlations 
are  performed  only  in  areas  containing 
features  of  interest.  These  areas  of 
interest  may  lie  anywhere  in  the  image 
plane,  and  so  could  be  thought  of  as 
fitting  the  description  of  "object- 
oriented."  On  the  other  hand,  this 
"object  orientation"  is  usually  an  attempt 
to  reduce  the  computational  load  on 
conventional  serial  computers.  The 
intrinsic  structure  of  the  processing 
implies  no  object  orientation,  and  so  we 
classify  it  as  coo r d i na t e -o r i en t ed . 
Correlation  is  also  a  linear  process,  and 
usually  is  performed  in  the  iconic  domain. 

Interior  point  selection,  our  sixth 
candidate  for  the  metric  set,  is  the 
process  of  identifying  those  points  of  the 
image  that  lie  within  closed  boundaries 
defined  by  previously  located  edge 
segments.  A  point  is  determined  to  be  on 
the  interior  of  a  closed  boundary  if  it 
there  are  edge  segments  within  a  certain 
radius  of  it,  in  a  majority  of  the 
directions  checked.  Since  a  point  is 
either  inside  an  object  or  not,  the 
processing  is  non-linear,  and  since  the 
data  may  lay  anywhere  within  the  image 
plane,  the  processing  is  global.  It  is 
furthermore  both  context-dependent  and 
object-oriented,  according  to  the 
definitions  given  earlier.  Finally,  it 
most  naturally  operates  on  iconically 
represented  data,  and  is  computation 
intensive,  in  that  little  intermediate 
data  is  stored  for  each  point  evaluated. 

Line  finding  is  the  seventh  metric 
candidate  we  have  considered.  By  "line 
finding,"  we  mean  those  routines  which  are 
concerned  with  linking  edge  segments 
together  into  lines,  and  "tracing"  the 
resulting  lines  to  determine  their  lengths 
and  orientations.  This  is  the  most 
clearly  object-oriented  process  that  we 
have  considered  so  far,  in  that  the 
"tracing"  operation  necessarily  involves 
following  the  line  wherever  on  the  image 
plane  that  it  might  go.  The  process  is 
particularly  interesting,  because  it 
operates  on  iconically  represented  data  to 
produce  symbolic  data.  As  mentioned 
earlier,  and  as  we  shall  see  subsequently, 
in  the  discussion  of  machine  architecture 
characteristics,  such  translation 
processes  pose  particularly  difficult 
problems  to  computer  architects.  On  array 
machines,  the  process  is  computation¬ 
intensive  according  to  our  earlier 
definition,  because  the  generated  line 
data  remains  distributed  across  the 
memories  of  a  number  of  the  cellular 
processors.  On  the  other  hand,  on  MIMD 


machines  (which  are  generally  better 
suited  to  this  sort  of  processing),  the 
process  is  memory  intensive,  requiring 
large  amounts  of  local  memory  for  its 
efficient  execution.  As  to  other 
classification  catagories,  the  operation 
is  non-linear,  global,  and  context- 
dependent  . 

Shape  descriptions  are  the  eighth 
member  of  our  metric  set.  Shape 
description,  like  line-finding,  involves 
translating  information  from  an  iconic  to 
a  symbolic  representation.  As  such,  it  is 
a  strongly  object-oriented  process, 
involving  step-by-step  tracing  of 
boundaries,  or  repeated  computation  of 
individual  line-segment  midpoints. 
Representative  algorithms  in  this  catagory 
include  invariant  moment  calculations, 
medial  axis  transforms,  and  generalized 
cones.  Shape  description  algorithms  are 
usually  global,  context-dependent,  and 
computation-intensive,  according  to  ou; 
classification  scheme.  They  are  also 
linear,  in  that  their  output  depends 
linearly  on  the  shape  characteristics  of 
the  input  structure. 

Our  two  final  entries  in  the  proposed 
metric  set  are  examples  of  more  purely 
symbolic  processing.  Graph  matching,  the 
first  of  our  symbolic  metrics,  involves 
searching  a  graph  for  a  sub-graph  having  a 
particular,  specified  structure. 
Prediction,  the  final  metric  candidate, 
involves  the  application  of  rules  to  a  set 
of  existing  data  to  predict  the 
probablility  of  occurrence  of  some 
particular  condition. 

Symbolic  processing  does  not  fit 
easily  into  our  classification  scheme,  in 
that  virtually  all  such  computation  is 
described  similarly  within  the  taxonomy. 
An  extension  to  the  classification  method 
is  doubtless  in  order.  A  paper^0  Ly 
Hillis  indicates  four  possible  catagories 
which  might  be  used  to  describe  symbolic 
processing.  Hillis*  ll6t  of  critical 
operations  for  symbolic  computation  may  be 
paraphrased  as  follows: 


•  Deduction  of  facts  from  semantic 
inheritance  networks. 

•  Matching  of  patterns  against  sets  of 
assertions,  demons,  or  productions. 
Best  matches  must  be  selected  in  the 
absence  of  a  perfect  match. 

•  Sorting  of  sets  according  to  chosen 
parameters . 

•  Searching  graphs  for  sub-graphs  with 
a  specified  structure. 


Our  proposed  graph  matching  metric 
directly  addresses  the  fourth  of  HiJlis' 
catagories,  while  our  prediction  metric  is 
irore  generally  directed  at  the  full  range 
of  processes. 


HAhDWARE  ANALYSIS 

Having  established  a  basis  for 
understanding  the  processing  requirements 
of  IU,  we  now  turn  to  an  evaluation  of 
parallel  hardware.  Here*  we  examine 
several  representative  classes  of 
architectures,  from  the  standpoint  of 
their  abilities  to  perform  the  various 
types  of  processing  described  by  our 
software  taxonomy.  Table  II  is  a  matrix 
showing  the  relative  ability  of  each  of 
those  architectures  to  perform  each  of  the 
classes  of  processing  discussed  earlier. 
In  the  matrix,  processors  are  evaluated  on 
a  five-point  scale,  ranging  from  "+  +  ",  for 
a  machine  that  is  highly  suited  to  the 
type  of  processing  represented  by  that 
column  of  the  matrix,  to  for  an 

architecture  that  is  highly  unsuited  to 
that  type  of  processing. 

The  table  lists  eight  processor 
catagories:  cellular  numeric,  pipelined, 
MIMD,  number  theoretic,  systolic, 
"broadcast",  data-driven,  and  associative. 
There  can  be  some  overlap  between  these 
catagories,  but  the  principle 
characteristics  of  each  class  are 
sufficiently  distinct  to  permit 
discussion.  In  the  following,  we  shall 
describe  each  architecture  briefly,  and 
examine  how  well  it  meets  the  processing 
requirements  represented  by  our 
previously-developed  software  taxonomy. 

Cellular  numeric  machines  are  those 
composed  of  an  array  of  identical 
processors,  or  calls,  directed  by  a  common 
instruction  stream,  but  operating  on 
separate  data.  The  processor  array  is 
usually  the  same  size  as  the  input  image 
data  array,  with  one  processing  cell 
assigned  to  each  pixel  of  the  input  image. 
Cellular  machines  are  perhaps  the  most 
popular  of  all  the  classes  listed,  with 
many  already  built  or  planned. H~15  Their 
hardware  advantages  are  great  parallelism, 
high  regularity,  and  simple  circuitry  in 
each  of  the  cells.  These  attributes 
combine  to  make  for  relatively  simple 
design  and  VLSI  layout.  Cellular  machines 
typically  employ  n e a r e s t - n e i g h b o r 
communication  between  array  members, 
although  some  provision  is  usually  made 
for  rapidly  propagating  or  "broadcasting" 
data  values  across  the  array  as  a  whole. 
Local  memory  for  each  processing  cell  is 
usually  fairly  limited. 


From  a  software  standpoint,  because 
of  their  great  parallelism,  cellular 
machines  are  excellent  for  local,  linear 
processing,  such  as  convolution.  They  are 
somewhat  less  efficient  at  non-lincar  or 
context-dependent  processing  due  to  their 
single  instruction  stream.  Cellular 
arrays  usually  execute  context-dependent 
algorithms  through  an  exhaustion  process, 
generating  all  possible  results,  and  then 
selecting  the  most  appropriate  for  each 
pixel  through  a  thresholding  operation. 
Due  to  their  limited  local  memory 
capacity,  they  are  also  less  well  suited 
to  memory-intensive  algorithms  such  as 
sorting,  which  requires  the  storage  of  a 
number  of  pieces  of  data  at  each  array 
location.  Their  next-neighbor 
communication  scheme  can  also  be  limiting 
in  algorithms  requiring  a  great  deal  of 
data  sharing  across  a  large  area.  A 
global-broadcast  capability  does  permit 
excellent  performance  at  such  tasks  as 
h i s t og r a m m i ng ,  where  a  large  number  of 
values  must  be  compared  with  all  data  in 
the  array.  Most  cellular  machines  to  date 
are  structured  for  iconic  data  processing, 
and  are  rather  unsuited  for  symbolic 
processing.  Cellular  machines  also  have 
difficulty  performing  object-oriented 
tasks.  Due  to  their  single  instruction 
stream  limitation,  they  waste  most  of 
their  parallelism  in  such  cases,  having  to 
deal  with  only  a  single  object  at  a  time. 

Pipelined  machines  are  ones  in  which 
data  is  operated  on  by  a  chain,  or  "pipe" 
of  hardware  units,  each  performing  a 
separate  function.  As  each  element  of  the 
chain  finishes  its  operation,  it  passes 
its  results  to  the  next  in  line,  and 
accepts  a  new  piece  of  data  from  the 
previous  unit.  Processing  occurs 
simultaneously  in  all  elements  of  the 
chain,  with  data  flowing  down  the  chain  as 
liquid  down  a  pipe.  Concurrency  is 
obtained  through  having  a  number  of 
operations  take  place  simultaneously  along 
the  pipe,  and  sometimes  by  having  a  number 
of  pipes  operating  simultaneously. 
Pipelining  is  fairly  common  in  commercial 
"array  processors."  Several  research 
machines  have  been  using  this  architecture 
for  use  in  IU.  16,17 

Since  they  rely  upon  fixed  sequences 
of  operations  within  algorithms, 
pipelines  exhibit  high  performance  in 
situations  in  which  the  type  and  sequence 
of  operations  to  be  performed  is  rigidly 
determined.  This  is  the  case  in  many 
lower-level  algorithms,  such  as 
thresholding  and  convolution.  The 
penalty  paid  for  the  Epeed  thus  obtained 
is  that  there  is  a  fixed  delay  from  the 
time  data  is  first  presented  at  the 
beginning  of  the  "pipe,"  and  the  time  when 
results  are  first  available  at  the  end. 


This  delay  corresponds  to  the  amount  of 
time  it  takes  for  the  data  to  pass  through 
all  the  functional  units  along  the  pipe. 
As  a  consequence/  a  heavy  time  penalty  is 
usually  paid  for  any  data-dependent 
program  branching.  This  is  because/  each 
time  a  branch  is  taken/  the  pipe  must  be 
"filled"  with  new  data  before  the  new 
results  become  available.  For  this 
reason/  pipelines  perform  less  well  on 
nonlinear,  context  dependent,  and  object- 
oriented  algorithms.  Machines  with 
multiple  pipes  can  also  be  limited  by 
memory  contention  problems,  if  the  pipes 
share  a  common  memory.  (Hence  the 
"average"  rating  for  pipelines  under  the 
"global"  and  "coordinate-oriented1 
headings  in  Table  II.)  Pipelines  are  also 
unsuited  for  symbolic  processing,  due  to 
the  high  degree  of  context-dependency 
involved . 

MIMD  stands  for  "Multiple 
Instruction,  Multiple  Data,"  and  refers  to 
machines  in  which  a  number  of  largely 
independent  processors  are  harnessed  to 
process  data  in  parallel.  Each  of  the 
processors  in  an  MIMD  architecture  execute 
their  own  instruction  stream.  A 
significant  factor  motivating  the 
development  of  such  machines  is  the  wide 
availability  of  cheap,  general-purpose 
microcomputer  chips.  (Computer  architects 
are  irresistably  led  to  ponder  the  power 
of  a  thousand,  or  better  yet,  a  million  Z- 
80  chips  working  in  parallel.)  Several 
such  machines  have  been  built  for  use  in 
IU,  and  are  presently  being  studied.^®'19 

Since  the  individual  processors  in 
MIMD  machines  are  usually  general-purpose 
ones,  with  large  memory  spaces  and  local 
program  storage,  the  architectures  as  a 
class  handle  context-dependent  and  memory¬ 
intensive  processing  with  greater  ease 
than  do  cellular  and  pipelined  systems. 
They  also  perform  object-oriented 
processing  better  than  either  of  these 
architectures,  due  to  their  large  numbers 
of  relatively  autonomous  processors.  The 
principle  drawbacks  of  MIMD  machines  are 
the  difficulties  of  coordinating  the 
operation  of  so  many  asynchronous 
processors  (interprocessor 
communications),  and  the  problem  of 
partitioning  the  image  data  across  the 
available  processor  memories. 
Communication  problems  arise  when 
processors  must  access  data  from  other 
processors'  memories,  and  this  accounts 
for  the  "average"  ratings  for  MIMD 
machines  in  the  "global,"  "object- 
oriented,"  and  "coordinate-oriented" 
catagories  in  Table  II. 

Number  theoretic  and  systolic 
processors  are  two  special  catagories  of 
machines  particularly  suited  to  linear, 


context-f  tee,  computation-intensive 
processing.  They  are  thus  highly  suited 
to  the  lowest  and  most  computation- 
intensive  levels  of  image  understanding, 
but  have  little  application  for  higher 
levels  involving  context-dependent, 
object-oriented,  or  symbolic  processing. 
Number  theoretic  processors  employ  the 
residue  number  system,  studied  extensively 
by  Szabo  and  Tanaka, 20  in  which  integers 
are  represented  by  their  "residues"  with 
respect  to  each  of  a  set  of  relatively 
prime  numbers  called  "moduli."  A  residue 
of  x  mod  m,  is  the  least  positive  integer 
remainder  of  the  division  of  x  by  m,  where 
x  is  the  number  to  be  converted,  and  m  is 
one  of  the  moduli.  If  M  is  the  product  of 
the  moduli  used  in  such  a  system,  integers 
in  the  range  of  1  i  x  i  (M-l)  can  be 
uniquely  represented  by  their  sets  of 
residues.  The  utility  of  the  residue 
system  lies  in  the  manner  in  which 
arithmetic  operations  are  performed. 
Essentially,  all  operations  are  done  in 
parallel,  mod  m^,  for  each  of  the  moduli. 
After  all  required  operations  have  been 
performed,  the  residue  representation  of 
the  result  is  converted  back  into  a  binary 
form.  The  fact  that  the  processing  in 
residue  form  is  all  done  modulo  m^,  where 
the  Bij  are  the  moduli  means  that 
multiplications  can  be  reduced  to  table- 
lookups,  without  requiring  prohibitively 
large  ROMs.  Consequently,  computation¬ 
intensive  tasks  can  be  performed  extremely 
rapidly.  A  processor  for  operating  on  5  x 
5  kernels  has  been  successfully  built  and 
tested  using  this  approach.21  The 
drawbacks  to  the  technique  are  that 
operations  such  as  thresholding  are 
difficult  to  perform  in  the  residue 
domain,  and  the  overhead  from  converting 
from  binary  to  residue  formats  and  back 
again  makes  the  technique  less 
advantageous  for  situations  in  which 
relatively  little  processing  is  to  be  done 
in  residue  form. 

Systolic  processors  are  a  class  of 
array  machines  exhibiting  extreme 
regularity  of  hardware.  Their  name 
derives  from  the  fact  that  processing 
within  the  array  occurs  in  the  form  of 
"waves,"  moving  from  one  edge  to  the 
other.  Each  processing  element  in  the 
array  takes  data  in  from  some  of  its 
neighbors  on  one  cycle,  processes  it,  and 
passes  the  results  on  to  its  other 
neighbors  a  cycle  or  two  later.  The  high 
regularity  of  the  hardware  and  data 
movement  within  such  arrays  makes  them 
ideal  candidates  for  VLSI  implementation, 
and  the  high  degree  of  parallelism 
attainable  results  in  very  high  processing 
throughputs.  Unfortunately,  they  are 
rather  weak  in  the  areas  of  context- 
dependent,  object-oriented,  memory¬ 
intensive,  and  symbolic  processing,  and  so 


144 


are  largely  unsuited  for  the  higher  levels 
of  IU  processing.  Several  machines  have 
been  built,  however,  and  have  demonstrated 
excellent  performance  in  the  domains  of 
their  greatest  utility. 22,23 

We  call  "broadcast"  machines  those  in 
which  individual  processors  communicate 
with  each  other  through  some  sort  of 
"message  net,"  in  which  data  paths  are 
established  through  the  inclusion  of 
routing  tags  appended  to  the  data  being 
transmitted.  Architectures  of  this  sort 
have  been  primarily  developed  for  symbolic 
processing.  The  overhead  associated  with 
their  communication  method  makes  them 
largely  unsuited  for  computation-intensive 
iconic  processing,  in  which  the 
flexibility  of  the  broadcast  communication 
scheme  is  not  required.  Some  variation  of 
the  communication  technique  may  prove 
useful  for  the  difficult  task  of  object- 
oriented  iconic  to  symbolic  translation. 
A  machine  using  a  "broadcast"  architecture 
is  currently  being  constructed  at  MIT,*0 
for  processing  semantic  netc. 

Data-driven  processors  are  machines 
composed  of  a  number  of  execution  units, 
each  of  which  performs  some  simple,  low- 
level  function.  These  units  are 
interconnected  according  to  requirements 
of  the  program  being  implemented.  That 
is,  if  for  instance,  a  multiplication 
involves  the  results  of  two  earlier 
addition,  the  outputs  of  two  addition 
execution  units  would  be  connected  to  twe 
inputs  of  a  multiplication  unit.  The 
execution  units  function  independently  and 
asynchronously.  Whenever  a  unit  has  all 
of  the  operands  it  needs  to  generate  a 
result,  it  "fires,"  and  passes  its  results 
to  any  other  units  to  which  it  might  be 
connected.  Dennis  and  Misunas2^  have  done 
much  to  popularize  data-flow  architectures 
(the  more  popular  term  describing  this 
class) . 

Data-driven  machines  promise  to 
achieve  very  high  throughputs  for 
c o m pu t a t  i  o n - i  n t e n s  i  v e  numerical 
processing,  because  they  eliminate  the 
instruction  bottleneck  common  to  so  many 
other  architectures.  Their  limitations 
for  use  in  IU,  though,  include  limited 
amounts  of  intermediate  memory,  and 
reduced  performance  for  highly  context- 
dependent  memory.  This  last  arises  from 
the  overhead  associated  with  changing  the 
configuration  of  the  processing  elements 
in  response  to  rapidly  changing  program 
requirements.  A  means  of  avoiding  this 
restriction  has  been  implemented2^  by 
expressing  the  execution  units  in  the 
software  of  an  MIMD  machine.  In  this 
variation,  the  "execution  units"  are 
instructed  cycle  by  cycle  what  operations 
they  are  to  perform.  This  greatly 


increases  the  flexibility  of  the  resulting 
machine,  but  creates  a  new  bottleneck  in 
the  hardware  responsible  for  task 
assignment . 

Associative  processors,  our  last 
architecture  catagory,  are  really  more  of 
an  attribute  than  a  separate  class  of 
architectures.  Many  of  the  architectural 
classes  discussed  earlier,  such  as 
cellular  numeric  or  MJMD  may  be  given  an 
associative  capability  simply  through  the 
addition  of  the  appropriate  hardware. 
Architectures  incorporating  associative 
capability  usually  gain  their  associative 
power  at  the  expense  of  numeric  processing 
capacity.  This  accounts  for  the  low  or 
"average"  ratings  of  associative  machines 
for  most  catagories  in  Table  II.  The 
strongest  point  in  favor  of  associative 
machines  (in  the  present  context,  at 
least)  is  the  exceptional  symbolic 
processing  performance  of  which  they  are 
capable.  They  derive  this  performance 
from  the  fact  that  they  may  search  sets  ot 
graphs  for  particular  elements  or  nodes  in 
parallel,  for  all  members  of  the  structure 
being  scanned.  STARAN26  is  probably  the 
best-known  associative  machine  constructed 
to  date. 


CONCLUSIONS 

In  the  foregoing,  we  have  identified 
six  sets  of  software  characteristics  which 
have  particular  relevance  to  the  type  of 
processing  demanded  by  the  algorithms  so 
described.  We  have  measured  the 
capabilities  of  a  range  of  machine 
architectures  against  the  processing 
requirements  implied  by  each  of  these 
software  characteristics,  and  have 
tabulated  the  results.  Perhaps  the  most 
valid  conclusion  to  be  drawn  from  this 
analysis  is  that  no  single  architecture  is 
capable  of  performing  all  classes  of 
processing  equally  well.  More  significant 
thoug!',  is  the  fact  that,  while  machines 
exist  that  are  well  suited  for  either 
iconic  or  symbolic  processing,  no  present 
architecture  is  efficient  at  the  task  of 
translating  iconically  represented  data 
into  a  symbolic  representation.  This  is 
of  key  importance  for  future  real-time 
knowledge-based  IU  systems,  in  that  they 
must  be  able  to  rapidly  and  effectively 
process  iconic  data,  and  subsequently 
input  that  data  to  the  knowledge-based 
portions  of  their  structure.  The  problem 
to  be  solved  is  one  of  concurrently 
processing  object-oriented  data,  without 
running  afoul  of  communication  or  memory- 
access  bottlenecks.  The  form  of  the 
ultimate  solution  to  this  problem  is  the 
Bubject  of  active  investigation  by  many 
researchers.  With  the  present  development 
in  the  low-level  numeric  architectures, 
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and  the  current  activity  in  symbolic 
processors,  this  topic  is  an  area  of 
fruitful  research.27 
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1. 


Introduction 


■''This  paper  describes  software  evaluation  methods  devel¬ 
oped  at  SRI  International  to  evaluate  contributions  to  the 
ARI’A/DM  \  Image  Understanding  (IU)  Testbed.  Examples 
of  evaluation  results  are  also  presented. 


The  primary  purpose  of  the  IU  Testbed  is  to  provide 
a  means  for  transferring  technology  from  the  DARPA- 
sponsored  IU  research  program  to  DMA  aod  to  other 
organizations  in  the  defense  community.  The  approach 
taken  to  achieve  this  purpose  has  two  components: 


ij,*  1  he  establishment  of  a  uniform  environment  as  com¬ 
patible  as  practical  with  the  environments  of  research 
centers  at  universities  participating  in  the  IU  research 
program.  Thus,  organizations  obtaining  copies  of  the 
Testbed  can  receive  a  continuing  flow  of  new  results 
derived  from  on-going  research.  i 
■3)  ) 


y-*-The  acquisition,  integration,  testing,  and  evaluation 
of  selected  scene  analysis  techniques  that  represent 
mature  examples  of  generic  areas  of  research  activity. 
'I  hese  contributions  from  participants  in  the  IU  re¬ 
search  program  will  allow  organizations  with  Testbed 
copies  to  begin  the  immediate  exploration  of  appli¬ 
cations  of  IU  technology  to  problems  in  automated 
cartography  and  other  areas  of  scene  analysis. 


L valuation  of  contributed  scene  analysis  techniques  has 
thus  been  a  major  thrust  of  the  Testbed  elTort.  Develop¬ 
ment  of  the  evaluation  methodology  has  been  a  related  goal. 
Software  evaluation  is  difficult,  and  few  independent  eval¬ 
uations  of  IU  software  have  been  published.  Analysis  of 
an  algorithm  alone,  even  if  feasible,  would  neither  guar¬ 
antee  correct  implementation  nor  quantify  performance  on 
realistic  problems.  Simple  tabulations  of  pixel  classification 
errors  (as  in  Yasnoff,  tt  al.  |1|)  would  not  be  meaningful  for 
complex  seen  analysis  tasks  Comparative  evaluation  us¬ 
ing  several  algorithms  or  software  packages  on  oi  e  set  of 
test  scenes  (as  m  Ranadc  and  Prewitt  [2||  was  not  practical 
for  testing  single  algorithms.  We  have  chosen  a  more  sub¬ 
jective  approach  based  on.  (1)  careful  analysis,  (2)  tests 
on  simple  and  complex  natural  scenes,  and  (3)  our  own 
experience  in  image  analysis.  This  is  similar  to  the  method 
advocated  by  Nagin,  el  al.  [3]. 


In  this  paper  I  describe  my  experiences  with  the  initial 
software  evaluation  efforts  on  the  IU  Testbed.  I  was  specif¬ 
ically  involved  with  the  evaluation  of  the  GIIOUGII  object 
detection  system  (4)  from  the  University  of  Rochester,  the 
PI  101  .NIX  segmentation  system  [5]  from  Carnegie-Mellon 
University  (CMU),  and  the  RELAX  relaxation  package  [0] 
from  the  University  of  Maryland  Many  other  software 
packages  have  been  contributed  to  the  Testbed,  but  have 
not  been  as  extensively  evaluated. 


2.  Evaluation  Purpose 

There  are  many  reasons  for  evaluating  software  packages. 
Managers,  systems  personnel,  and  users  all  have  different 
perspectives  and  different  requirements.  These  imply  many 
different  questions  that  must  be  answ-ered  by  a  thorough 
evaluation  effort.  Some  of  the  major  questions  are: 

•  Acquisition  —  Should  the  software  package  be  ac¬ 
quired  and  further  evaluated  for  local  applications? 
What  are  its  capabilities’  Can  it  be  extended? 

•  Implementation  —  What  operating  system  support 
is  required’  How  much  memory  docs  the  package 
need?  Dow  much  time  does  it  take  to  ran?  Does 
the  implementation  correspond  to  the  documented 
algorithm’  Does  performance  match  theoretical  pre¬ 
dictions’  How  well  is  the  code  structured  and  com¬ 
mented?  Is  the  documentation  adequate? 

•  Application  —  Is  the  package  suitable  for  a  particular 
application’  Is  the  user  interface  adequate?  How  does 
the  purkngo  perform’  Can  it  be  integrated  with  otter 
packages? 

We  have  attempted  to  answer  these  questions  in  our  eval¬ 
uation  reports.  The  first  section  of  each  report  introduces 
the  package  at  a  management  level,  answering  questions 
about  the  tasks  for  which  the  algorithm  is  suited.  Subse¬ 
quent  sections  arc  written  for  system  implemeDters  and  for 
users.  The  final  sections  document  performance  on  evalua¬ 
tion  tasks  and  make  suggestions  for  future  improvements. 

An  evaluation  effort  may  have  subsidiary  effects  on  the 
software,  the  Testbed,  and  the  personnel  involved 

•  Adaptation  —  The  evaluation  effort  has  spurred  sev- 
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oral  authors  to  pohsh  or  document  their  software  be¬ 
fore  releasing  it  to  the  IU  Testbed  S°veral  contri¬ 
butions  had  to  be  translated  into  the  C  language  be¬ 
fore  submission;  the  software  thus  became  available 
on  new  classes  of  systems  Such  “packaging"  can  be 
a  significant  step  in  the  life  of  a  software  system. 

•  Validation  The  processes  of  translating,  installing, 
and  evaluating  contributed  software  have  often  led  to 
the  discovery  of  programming  bugs  and  occasionally 
bugs  in  the  algorithms.  Where  bugs  are  not  found, 
there  is  greater  assurance  that  such  bugs  do  not  exist. 

•  Training  —  \Ne,  the  evaluating  personnel,  had  to 
Irani  to  use  the  software  and  to  understand  the  theory 
behind  it,  thus  extending  the  knowledgeable  user 
community  We  have  documented  this  understanding 
and  are  otherwise  communicating  it  to  others. 

•  Documentation  —  Submission  of  software  for  evalu¬ 
ation  has  often  spurred  initial  documentation  of  the 
package  Any  weakness  in  this  documentation  were 
brought  to  light  as  we  learned  to  use  the  package  We 
have  then  filled  in  the  gaps  and  have  added  any  nec¬ 
essary  overview,  literature  survey,  operating  instruc¬ 
tions,  performance  examples,  and  suggestions  for  im- 
prov  ement. 

We  have  also  placed  notes  to  future  implementcrs 
and  users  in  the  source  code  and  in  the  on-line 
man  page  documenting  the  Testbed  version  of  the 
contribution  (These  man  pages  are  included  in 
the  Testbed  programmer's  manual  [7]  )  We  believe 
that  such  channels  of  communication  between  users 
scattered  in  space  and  time  are  essential  for  the 
continued  growth  of  the  software. 

•  Augmentation  —  We  generally  had  to  modify  the  sub¬ 
mitted  code  to  use  local  graphics  and  user  interface 
routines,  to  instrument  the  code  with  additional  dis¬ 
plays  or  printouts  of  internal  variables,  and  to  rewrite 
portions  of  the  code  to  eliminate  trivial  restrictions 
or  to  make  the  package  more  ellicient  for  particu¬ 
lar  tasks  The  dividing  line  between  evaluation  and 
new  development  is  not  clear,  but  it  is  clear  that  the 
evaluation  effort  often  leads  to  improvements  in  the 
software.  The  Testbed  environment  also  had  to  grow 
to  support  the  contributions,  and  many  ideas  from 
the  contributions  have  been  adapted  for  use  in  other 
software. 

3.  Evaluation  Structure 

The  tasks  involved  in  evaluating  a  contribution  are 
ri  llected  m  the  structure  of  the  evaluation  report  We  have 
developed  this  structure  for  recording  and  communicating 
the  results  of  our  investigations. 

The  introduction  to  a  report  summarizes  the  nature  of 
the  reviewed  software  package,  the  computer  languages  or 
system  facilities  needed  to  support  it,  and  the  contributions 
of  various  people  in  designing,  creating,  and  maintaining  it. 


The  succeeding  background  section  describes  the  package 
from  a  management  viewpoint  Generally  this  is  one  of  the 
last  sections  written  because  it  requires  knowledge  gained 
from  the  entire  evaluation  effort  First  there  is  a  general 
description  of  the  package,  including  its  purpose,  inputs, 
processing  steps,  and  outputs.  Then  typical  applications 
and  usage  scenarios  are  described,  including  preconditions 
and  the  domain  of  applicability,  relation  to  preprocessor 
and  postprocessor  programs,  applications  that  have  been 
documented  in  the  literature,  and  potential  applications 
that  we  or  other  researchers  have  suggested 

The  background  section  also  describes  potential  exten¬ 
sions  and  related  applications.  Potential  extensions  are  ap¬ 
plications  that  might  be  feasible  if  the  package  were  modi¬ 
fied  or  extended,  used  in  a  nonstandard  fashion,  or  incorpo¬ 
rated  as  an  element  of  a  larger  system.  Related  applications 
arc  generalizations  or  variants  of  the  standard  applications 
for  which  other  techniques  seem  to  be  more  appropriate 

A  descriptive  section  then  documents  the  algorithm  in 
detail.  We  begin  with  its  historical  development  to  intro¬ 
duce  vocabulary  and  to  put  the  major  technical  issues  into 
perspective.  Literature  references  are  cited  to  give  credit 
where  credit  is  due,  to  aid  researchers  in  finding  the  full 
range  of  concepts  that  have  been  explored,  and  to  provide 
managers  and  implementors  with  contacts  for  further  in¬ 
quiries.  The  section  closes  with  a  detailed  statement  of  the 
algorithm,  including  further  discussion  of  design  options 
and  references  to  the  literature  as  appropriate. 

The  next  section  is  a  brief  implement er’s  guide  describing 
the  structure  of  the  contributed  software  and  the  Testbed 
locations  of  its  source  files,  executable  files,  on-line  doc¬ 
umentation,  and  demonstration  files.  This  is  information 
needed  to  install  and  run  the  package  or  to  modify  and 
maintain  it.  We  have  included  here  a  description  of  the 
SRI  modifications  to  the  contribution. 

A  program  documentation  section  then  serves  as  a  users’ 
guide  to  running  the  package  and  invoking  all  of  the  al¬ 
gorithm  features.  We  have  given  instructions  for  both  in¬ 
teractive  and  batch  (or  background)  execution,  including 
documentation  of  all  command-line  invocation  options,  in¬ 
teractive  commands,  controlling  variables  and  flags,  and 
status  variables.  Sometimes  wc  have  also  found  it  neces¬ 
sary  to  give  a  detailed  description  of  the  program’s  exe¬ 
cution  phases,  complementing  the  theoretical  description 
of  the  algorithm  in  previous  sections.  This  section  of  the 
evaluation  report  could  be  omitted  in  cases  where  existing 
documents  provide  adequate  and  unified  documentation  of 
the  program 

Our  report  can  now  document  the  evaluation  proper 
We  have  divided  the  evaluation  section  into  two  parts: 
effects  of  parameter  settings  and  performance  statistics  for 
representative  tasks.  A  subjective  summary  may  also  be 
included. 

The  purpor e,  intended  effect,  and  legal  values  for  each 
parameter  and  control  variable  were  specified  in  the  last 
section.  In  this  section  we  probe  more  deeply,  determining 
the  true  effect  of  each  parameter  on  system  performance 
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and  documenting  interactions  (either  constraints  or  syner¬ 
gistic  effects)  with  other  parameters  The  end  result  is  a 
set  of  rules  for  setting  the  parameters  in  various  process¬ 
ing  situations  We  also  comment  on  the  usefulness  of  the 
control  features  and  give  suggestions  for  improving  them. 

Next  we  document  the  performance  of  the  system  on  se¬ 
lected  scene  analysis  tasks.  (Selection  of  the  tasks  is  dis¬ 
cussed  later  in  tins  paper  )  We  describe  the  test  protocols, 
including  the  input  images  and  the  parameter  settings  that 
we  found  optimal  for  the  tasks  We  present  subjective  and 
objective  performance  measures  and  summarize  the  appar¬ 
ent  strengths  and  weaknesses  of  the  algorithm  and  the  soft¬ 
ware  implementation 

Subjective  trials  are  difficult  to  document.  We  ran 
hundreds  of  trials  on  dozens  of  images  Often  a  trial 
designed  to  investigate  one  effect  would  turn  up  something 
else  as  well,  ft  is  impractical  to  illustrate  each  of  these 
findings  in  the  final  report  (Many  are  of  the  form  “Note 
how  the  edge  detector  found  this  weak  edge  but  missed 
that  much  stronger  one.")  We  have  therefore  attempted 
to  summarize  our  findings  and  present  only  the  relevant 
information. 

fn  the  next  report  section,  we  suggest  substantial  mod¬ 
ifications  to  the  algorithm  or  the  implementation  Some 
of  these  are  of  potential,  but  uncertain,  immediate  benefit 
and  some  are  extensions  into  task  areas  far  beyond  those 
considered  by  the  original  author.  We  also  mention  known 
improvements  to  the  contributing  institution's  continuing 
software  development  that  have  not  been  incorporated  into 
the  more  stable  Testbed  version.  Many  suggestions  are  de¬ 
rived  from  the  work  of  other  researchers,  in  which  case  we 
supply  the  appropriate  references.  Other  suggestions  arise 
from  our  own  evaluation  effort. 

Our  evaluation  report  concludes  with  a  summary  of 
the  major  technical  concepts  and  of  the  strengths  and 
weaknesses  of  the  contributed  algorithm  and  software. 
Appendices  may  give  further  information  about  the  task 
domain,  the  algorithm,  or  the  software  package  that  Ls  too 
detailed  for  the  main  body  of  the  report  but  is  not  readily 
available  elsewhere. 

4.  Evaluation  Methodology 

We  have  focused  our  evaluation  efforts  on  the  topics  of 
greatest  utility  Issues  of  applicability  and  of  parameter 
effects  and  interactions  have  been  given  highest  priority;  is¬ 
sues  of  resource  utilization  have  been  given  lower  priority, 
because  they  are  dependent  on  the  algorithm  implementa¬ 
tion  and  supporting  hardware. 

Some  of  the  most  difficult  evaluation  issues  have  to  do 
with  the  theory  behind  the  algorithm.  We  have  attempted 
to  summarize  the  theoretical  basis  of  each  contribution, 
but  evaluation  of  ti.e  theory  is  generally  impractical.  The 
best  we  could  do  is  to  document  other  approaches  to  similar 
tasks  and  to  note  strengths  and  weaknesses  of  the  algorithm 
as  reported  in  the  literature  or  found  in  our  own  work,  For 
this  reason  we  have  included  an  extensive  literature  survey 


in  each  of  onr  evaluation  efforts 

fair  evaluation  of  a  contribution  required  that  we  choose 
particular  tasks  for  it  to  perform.  Some  delicacy  was 
needed  in  making  these  choices  It  would  be  unfair 
to  evaluate  the  software  on  tasks  for  which  the  author 
considered  it  nusinted  It  would  also  be  pointless  to  use 
only  tasks  that  had  been  well  documented  in  the  literature, 
the  essence  of  evaluation  is  the  learning  of  something  new 
We  have  tried  to  choose  tasks  that  are  well  within  the 
contribution's  domain  and  yet  of  fundamental  interest  to 
automated  cartography  and  scene  analysis. 

The  CJIIOI'CII  object  detection  system  came  with  in¬ 
structions  for  finding  a  distinctive  lake  in  an  aerial  scene 
We  chose  the  finding  of  circular  objects  in  aerial  scenes  and 
right-angled  corners  in  oblique  scenes  as  additional  tasks. 
The  PHOENIX  segmentation  system  came  with  a  test  case 
of  segmenting  an  orange  chair  from  a  white  background 
We  chose  skyline  analysis  as  a  realistic  task.  The  RELAX 
probabilistic  relaxation  system  was  set  up  for  noise  clean¬ 
ing  of  an  infrared  image  of  a  tank  We  chose  gradient  edge 
detection  and  segmentation  of  vehicles  from  roads  as  ad¬ 
ditional  tasks  In  each  case,  the  imagery  was  rich  enough 
that  performance  on  auxiliary  problems  (t.f.,  nonpurposive 
segmentation)  could  be  subjectively  evaluated. 

For  each  task-,  we  selected  suitable  imagery,  ran  dozens 
of  trials  to  establish  optimal  parameter  settings,  and  doc¬ 
umented  the  results.  If  little  documentation  and  operat¬ 
ing  information  came  with  the  package,  we  spent  much  of 
our  time  learning  and  recording  this  information.  If  doc¬ 
umentation  was  adequate  and  few  parameters  had  to  be 
explored,  we  were  able  to  spend  more  time  recording  oper¬ 
ating  characteristics  and  performance  statistics.  Economic 
constraints  limited  the  lepth  to  whi  h  any  task  could  be 
evaluated,  but  we  were  able  to  provide  an  adequate  foun¬ 
dation  for  future  researchers  with  specific  problems. 

The  first  step  in  evaluating  any  package  was  to  get 
it  working  on  a  simple  test  image  —  usually  an  image 
provided  by  the  author.  This  process  occasionally  took 
considerable  effort;  we  chose  to  rewrite  the  entire  I/O 
structure  and  user  interface  for  one  program,  for  instance. 
This  integration  effort  was  essential  to  the  development  of 
the  Testbed  but  did  raise  a  thorny  issue:  to  what  extent 
should  we  fix  perceived  deficiencies  and  to  what  extent 
slion!  I  we  simply  document  them’  One  rule  of  thumb  was 
that  v.e  would  fix  or  extend  the  code  in  any  manner  required 
to  carry  out  an  evaluation  on  realistic  tasks. 

The  next  step  was  to  test  the  software  rigorously  on  one 
or  more  simple  images.  The  idea  was  to  become  familiar 
with  the  workings  of  the  program  and  with  the  options 
available  to  the  user  This  step  also  helped  identify  software 
bugs  or  misunderstandings  about  the  intended  functions  of 
the  program  We  strongly  recommend  the  use  of  generated 
or  well  understood  problems  as  one  phase  of  the  evaluation 
effort. 

Investigation  of  parameter  interactions  was  one  of  the 
most  difficult  evaluation  tasks  Analysis  of  simple  imagery 
permitted  us  to  concentrate  on  internal  variables  instead 
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of  interactions  with  a  complex  environment  liven  so,  the 
“searrh  space"  of  possible  interactions  was  immense,  The 
PHOENIX  program,  for  example,  has  14  major  threshold 
values  to  control  image  segmentation,  in  addition  to  various 
control  strategies  and  interaction  options, 

One  could  navigate  this  complexity  hv  using  an  intelligent 
driver  system  to  monitor  thousands  of  runs,  modifying 
parameter  settings  each  time  to  optimize  some  performance 
criterion  While  such  an  approach  is  fe.asible  [8],  it  would 
have  provided  only  a  superficial  understanding  of  why  the 
identified  parameter  sets  were  optimal  combinations.  We 
chose  instead  to  analyze  the  program  structure,  experiment 
with  carefully  chosen  parameter  values,  and  study  the 
execution  (as  opposed  to  a  single  result)  of  each  computer 
run  Often  we  had  to  disable  features  of  the  program  in 
order  to  study  one  feature  in  isolation. 

The  filial  experimental  step  was  to  evaluate  the  software 
for  realistic  tasks  on  “natural”  imagery.  This  proved  to 
be  exceedingly  difficult  because  the  space  of  input  imagery 
was  impossibly  large  If  a  program  could  detect  circular 
tanks  in  one  image,  for  instance,  would  it  be  abk  to  detect 
them  at  different  image  resolutionr.  With  different  levels  of 
contrast  and  blur?  With  strong  shadows  and  highlighting 
present’  With  occlusions,  unusual  edge  alignments,  or 
textur-  effects  present?  Would  it  he  able  to  distinguish  real 
targets  (possibly  camouflaged)  from  decoys  and  destroyed 
targets’ 

Such  questions  go  beyond  the  scope  of  this  initial  evalu¬ 
ation  effort  We  tried  our  best,  however,  to  get  a  “feel"  for 
each  program's  capabilities.  VVe  varied  pertinent  imagery 
variables  and  carefully  noted  the  effects.  Anomalies  were 
checked  out  by  instrumenting  the  code  or  by  experimen¬ 
tation  on  simple  images.  We  believe  that  this  intelligent 
experimentation  is  at  least  as  useful  as  extensive  statistical 
validation  would  be 


5.1.  Theory 

We  have  documented  the  theoretical  basis  and  the  imple¬ 
mentation  of  each  contribution  from  several  perspectives 
We  have  provided  theoretical  justifications  and  mathemat¬ 
ical  notations  where  appropriate,  and  have  then  related  this 
information  to  the  parameters  and  commands  of  the  soft¬ 
ware  packages  Sometimes  it  was  quite  difficult  to  extract 
this  information  from  the  technical  literature 

The  RELAX  system,  for  instance,  could  be  regarded  as 
a  general  method  for  local  modification  of  constraint  and 
compatibility  information  stored  in  the  nodes  of  a  rectilin¬ 
ear  graph.  The  initial  label  probabilities  at  the  nodes  may 
be  t'erived  from  image  pixel  intensities  and  the  final  label 
assignments  may  be  mapped  back  to  pixel  intensities,  but 
the  iterative  relaxation  technique  is  independent  of  image- 
domain  considerations.  Much  of  the  theoretical  work  on 
relaxation  has  abandoned  the  rectilinear  image  plane  and 
has  dealt  with  constraint  relations  on  arbitrary  graphs  with 
varying  numbers  of  neighbors  for  each  node. 

For  the  evaluation,  we  extracted  the  updating  equations 
actually  used  in  the  software  package  and  expressed  them 
in  terms  common  in  the  theoretical  literature.  The  REL.AX 
package  includes  both  the  IIuinmel-Zucker-Rosenfeld  addi¬ 
tive  updating  scheme  and  the  Peleg  multiplicative  updating 
scheme  Here  is  part  of  our  description  of  the  former. 

The  goal  of  the  relaxation  algorithm  is  to  update  the 
values  of  the  probabilities  associated  with  a  node  to 
reflect  the  compatibility  of  neighboring  labels.  The 
{/+  1)  update  of  the  Fth  label  value  is  calculated  from 
the  previous  time  (/)  update  by 
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Unfortunately  we  were  not  able  to  devise  rigorous  per¬ 
formance  metrics  for  tasks  such  as  “target  detection."  We 
carefully  tuned  the  analysis  system  for  each  problem  and 
reported  the  best  performance  that  could  be  obtained.  (We 
tried  to  avoid  tuning  the  system  for  each  image,  however 
A  single  parameter  set  or  operating  procedure  was  devel¬ 
oped  for  each  task  )  The  results  arc  necessarily  subjective 
and  would  vary  slightly  for  other  tasks,  other  imagery,  or 
other  experimenters. 


5.  Evaluation  Examples 

It  is  difficult  to  convey  the  scope  and  variety  of  our 
evaluation  results  in  a  short  paper  The  PHOENIX 
report  alone  is  more  than  80  pages  long,  with  25  pages 
devoted  to  performance  evaluation  and  suggestions  for 
future  development  I  will  illustrate  the  evaluation  results 
by  presenting  short  excerpts  from  the  reports.  Different 
reports  will  be  used  to  illustrate  different  points  about  the 
na'ure  of  the  evaluation  effort 


'.,,+,)(X*)  = 


Ekx1P.,0(*)(i  +  7.(,)(M) 


where  )  indexes  the  m  neighbors  of  node  i.  r,y ( X * ,  X') 
is  the  compatibility  coefficient  for  node  i  with  lal  1 
A*  and  neighboring  node  j  having  label  \f.  g,(A*)  can 
be  though  of  as  the  assessment  by  the  neighboring 
nodes  that  node  i  should  be  labeled  X*,  while  p,(A*) 
is  the  assessment  by  node  i  a„  *o  its  own  label  These 
two  assessments  are  combined  to  produce  an  updated 
probability,  y>,(Xt). 

The  compatibility  coefficients  may  be  negative  if  the 
labels  are  incompatible,  positive  if  the  labels  are 
compatible,  and  zero  if  they  are  independent  While 
it  is  possible  to  define  the  compatibility  coefficients 
in  terms  of  conditional  probabilities,  it  is  overly 
restrictive  to  do  so  The  compatibility  coefficients 
for  the  Hutnniel-Zucker-Rosenfeld  rule  are  based  on 
information  theory;  mutual  information  defines  the 
compatibility  coefficients  and  provides  a  mechanism 
for  calculating  them: 
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1  |n  wE,P.(X*)p,,  (X,) 
5  E,Pi(X*)  •  E,P.(X/) 


where  k  and  /  range  over  the  n  labels,  i  ranges  over  all 
te  nodes  of  the  graph,  and  j  specifies  the  particular 
neighbor  (eg,  upper-left)  of  the  i-th  node.  For  each 
node  i,  r,j(X*,  X/)  is  equal  to  ry(X*,X()  clipped  to  be  in 
the  closed  interval  [-1,1] 

In  the  report  we  have  discussed  the  meaning  of  these 
terms  and  methods  of  estimating  or  setting  the  numeric 
values,  as  well  as  the  effects  of  relaxation  in  various  image 
analysis  tasks. 


5.2.  Analyses 

The  following  are  a  few  results  from  the  performance 
analyses  on  the  PHOENIX  and  GHOUGH  systems.  Space 
limitations  prevent  a  full  description  of  all  of  the  terms 
us"d,  but  the  examples  should  give  some  feeling  for  the  level 
of  understanding  that  a  thorough  evaluation  may  require 

The  PHOENIX  segmenter  is  a  moderately  complex  sys¬ 
tem  with  14  user-settable  variables  that  control  the  seg¬ 
mentation  process  itself  The  original  contribution  came 
with  very  little  guidance  about  setting  these  parameters  to 
achieve  reasonable  segmentations.  One  of  our  evaluation 
ta^ks  was  the  creation  of  such  information. 

Wo  began  by  finding  a  set  of  parameter  values  that  would 
segment  simple  scenes  of  large  objects  down  to  the  level  of 
nmj  >r  subregions.  We  called  this  a  “moderate"  segmen¬ 
tation  Iben  we  developed  a  set  of  parameter  values  for 
very  coarse  segmentation  using  “strict”  heuristics  to  disal¬ 
low  most  potential  region  splits.  Finally  we  developed  a  set 
of  values  for  complete  or  overly  permissive  segmentation 
using  “mild"  threshold  screening  heuristics. 

Each  column  in  Table  1  lists  one  of  these  parameter  sets. 
The  mer  need  only  select  the  extent  of  segmentation  desired 
and  load  the  corresponding  parameters.  We  thus  reduced 
the  I  I  parameters  to  a  manageable  single  decision  that  is 
relatively  independent  of  the  image  content.  Additional 
flexibility  is  possible  by  switching  parameter  sets  during  a 
segmentation  run  to  control  the  fineness  of  segmentation 
within  particular  image  regions. 

It  will  occasionally  be  necessary  for  the  user  to  deviate 
from  the  recommended  parameter  settings.  To  make  this 
possible,  we  have  evaluated  each  parameter  individually. 
Here  is  part  of  the  maxmin  parameter  description: 

Maxmin  is  the  minimum  acceptable  ratio  of  apex 
height  to  higher  shoulder  for  an  interval  in  the 
histogram.  Any  interval  failing  this  test  is  merged 
with  the  neighbor  on  the  side  of  the  higher  shoulder. 
The  test  is  then  repeated  on  the  combined  interval. 

The  overall  effort  on  a  set  of  cutpoints  Is  to  eliminate 
tliosi  that  are  on  the  sides  or  tops  of  major  peaks. 

Maxmin  is  a  powerful  heuristic.  With  strict  smooth¬ 
ing  and  all  other  heuristics  disabled,  maxmin  alone  is 
able  to  produce  reasonable  segmentations.  It  is  even 


Parameter 

Strict 

Mod 

Mild 

depth 

4 

10 

20 

splitmm 

200 

40 

20 

hsmoolh 

25 

9 

5 

maxmin 

300 

160 

130 

absarea 

too 

30 

5 

relarea 

10 

2 

1 

height 

50 

20 

10 

absmin 

2 

10 

31 

intsmax 

2 

3 

f 

isetsmax 

2 

3 

5 

absscore 

92  b 

700 

600 

relseore 

95 

80 

65 

noise 

50 

10 

5 

retain 

4 

20 

40 
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more  powerful  when  combined  with  the  area  heuris¬ 
tics.  With  mild  or  moderate  smoothing,  maxmin 
passes  clusters  of  cutpoints  in  the  noise  regions  be¬ 
tween  major  peaks  This  b  fine  if  the  clusters  can  be 
thinned  by  the  abaarea  and  relarea  heurbtics,  but 
a  poor  selection  may  be  made  if  they  are  left  for  the 
intsmax  heuristic. 

The  problem  here  is  that  PHOENIX  has  no  “quality” 
score  for  histogram  valleys.  It  assumes  that  outpoint 
bin  height  is  an  adequate  measure,  whereas  width 
and  depth  relative  to  the  neighboring  peaks  are 
also  important.  PHOENIX  can  only  incorporate 
such  knowledge  by  smoothing  the  histogram,  and 
the  amount  of  smoothing  required  depends  on  how 
separated  the  peaks  are. 

The  next  step  in  the  PHOENIX  evaluation  was  investiga¬ 
tion  of  a  skyline  delineation  task.  One  of  the  test  images, 
Portland,  shows  a  city  skyline  against  a  cloudy  sky.  After 
describing  segmentation  performance  on  reduced  versions 
of  this  and  other  images,  we  reported  the  following: 

A  test  sequence  was  run  on  the  full-resolution 
(500x500)  Portland  image.  Strict  and  even  moder¬ 
ate  heuristics  were  unable  to  segment  the  image  when 
only  the  red,  green,  and  blue  feature  planes  were  used; 
it  was  necessary  to  use  the  mild  heuristics.  The  best 
approach  would  be  to  start  the  segmentation  with 
mild  thresholds  and  then  return  to  strict  or  moder¬ 
ate  ones  for  segmenting  the  subregions.  Instead,  we 
avoided  such  special  interference  and  ran  the  segmen¬ 
tation  to  completion  using  mild  heuristics.  The  full 
run  (which,  with  the  V  flag  set,  generated  10,000 
lines  of  printout)  required  33  minutes  of  CPU  time: 


PHASE 

REAL 

CPU 

llislogram 

0:04:13 

0:02  32 

lotervat 

0  18  12 

007:27 

Threshold 

0  10  00 

0:03:47 

Patch 

0:03  51 

00.3:30 

Collect 

0  38  12 

014:01 

Segmentation 

1  18:15 

0  32:34 

The  final  segmentation  into  JJ82  regions  (including 
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nearly  every  window  of  every  building)  was  much  bet¬ 
ter  than  the  original  attempt,  but  still  had  difficulties 
distinguishing  a  glass-surfaced  building  from  the  sky 
that  it  reflected 

Later  test  runs  showed  even  better  performance  when 
color  transforms  were  used  in  addition  to  the  three  original 
color  planes.  Based  on  our  experience  with  these  tests,  we 
were  able  to  suggest  operating  procedures  for  the  use  of 
PHOENIX  ill  similar  tasks 

The  evaluation  of  the  GIIOUGH  object  detection  system 
was  similar  Because  GIIOUGH  had  fewer  parameters,  we 
were  able  to  spend  more  time  analyzing  system  performance 
on  realistic  tasks,  One  thrust  of  this  effort  was  to  develop 
an  understanding  of  specific  operational  characteristics,  as 
hi  the  paragraphs  below 

The  requirement  of  sharp  edges  does  not  imply  that 
smooth,  continuous  object  boundaries  are  required. 
The  program  is  quite  tolerant  of  noise  in  the  outline 
and  is  able  to  (ind  irregular,  incomplete,  or  discontin¬ 
uous  shapes.  The  circle  template,  for  instance,  often 
responds  to  forest  clearings,  tree  tops,  road  intersec¬ 
tions,  and  curved  embankments,  as  well  as  to  square 
buildings  and  to  image  “hot  spots  ”  The  irregulari¬ 
ties  in  these  image  structures  spread  the  vote  cluster 
in  the  accumulator,  but  the  local  maximum  may  still 
be  above  the  general  noise  level 

Shadow  edges  usually  fit  the  requirement  for  strong, 
sharp  edges.  It  is  often  easier  to  find  a  shadow 
than  to  find  the  object  that  cast  it.  This  may  be 
a  useful  cueing  technique,  but  must  be  used  carefully 
to  avoid  reporting  objects  at  incorrect  locations  A 
similar  problem  exists  with  high-resolution  imagery 
the  position  reported  for  a  part  of  an  object  (c.j.,  the 
circular  top  of  a  storage  tank)  may  not  correspond  to 
the  position  of  the  whole  object. 

These  characteristics  mean  that  the  program  Is  best 
suited  for  three  tasks  locating  industrial  parts  in 
high-contrast,  imagery;  counting  numerous,  obvious, 
similar  objects  such  as  storage  tanks,  barracks,  or 
microscopic  particles;  and  p’ccisoly  positioning  a 
template  when  an  approximate  location  is  cued  by 
the  user  or  by  another  system.  Even  for  these 
applications,  the  program  must  be  supervised  and  its 
output  edited.  Other  applications  will  require  further 
development  of  the  technique. 

Sometimes  our  results  were  quite  unexpected,  as  when  we 
found  that  increasing  the  number  of  points  in  the  search 
template  definition  had  no  effect  on  execution  time  and 
could  actually  decrease  target  location  accuracy.  Execution 
time  was  unaffected  because  each  edge  in  the  image  votes 
for  only  the  best  matching  template  edge  (or  set  of  edges), 
regardless  of  the  number  of  similar  or  nearby  edges  in 
the  template.  Performance  could  be  degraded  because 
the  template  points  were  entered  at  discrete  points  on  a 
Cartesian  grid,  and  close  spacing  of  the  points  caused  severe 
quantization  of  the  relationships  between  them. 


Wc  were  able  to  quantify  system  performance  on  repre¬ 
sentative  tasks.  Some  of  the  domain-independent  equations 
are  given  below.  (The  terms  are  fully  explained  in  the  eval¬ 
uation  report  )  We  have  attempted  to  base  the  formulas 
on  important  characteristics  of  the  CHOUGH  algorithm, 
although  the  coefficients  had  to  be  estimated  empirically. 

[Cdgt  time  —  0003G(u  inrfou>  point*)  +  0053 (edge)  found) 

+  .00019(acrum«/nf!jr  entries)  +  (additional  paging  time) 

inalgiif  time  —  10-,(rrarrti  volume) 

X  (  08  +  2.0 tog(  1  +  accumulator  dmn'tg)) 

+  (additional  paging  time)  +  0025(moximo  found ) 

Maxima  «=  0'23(icarch  volume)0  '(1  +  accumulator  dcniitg)1  —  1 

Noire  =  2  OtfrearcA  tolume)00,(l  +  accumulator  deniitgf  '  —  1 


Such  formulas  would  be  very  helpful  in  designing  an 
improved  version  of  GIIOUGH.  Even  more  exciting  is  the 
possibility  of  building  an  expert  image  analysis  system  that 
would  include  GIIOUGH  as  a  component.  The  knowledge 
base  of  such  a  system  would  record  predictive  formulas  and 
other  operating  characteristics  in  a  form  that  could  be  used 
by  both  humans  and  machines. 

Some  of  the  GIIOUGH  parameters  arc  dependent  on 
image  content.  These  were  very  difficult  to  quantify,  but 
we  attempted  t  >  document  the  dependencies  well  enough 
that  users  could  adapt  our  findings  to  their  own  imagery. 
The  following  is  our  discussion  of  CHOUGH  performance 
as  a  function  of  the  edge-detection  threshold 

The  number  and  density  of  edges  detected  in  an 
image  are  sigmoid  (s-shaped)  functions  of  edge 
threshold  similar  to  cumulative  frequency  bistogTams. 
CHOUGH  operates  best  when  10ro  to  20%  of  the  pix¬ 
els  arc  classified  as  edge  points,  although  it  will  usu¬ 
ally  work  well  at  any  edge  density  above  b%.  Some 
typical  threshold  values  to  achieve  specified  edge  den- 
sit  ics  are: 


Scene  Type 

6% 

12% 

25% 

50% 

Cloudy  eky 

42 

35 

28 

20 

Aerial  terrain 

ICO 

120 

80 

40 

Aerial  largel  area 

200 

180 

120 

60 

t,ow-angte  urban 

2C0 

200 

140 

00 

Forest  cover 

??0 

220 

160 

100 

Aerial  urban 

720 

600 

480 

340 

In  general  it  is  better  to  use  too  low  a  threshold  this 
will  increase  chances  of  finding  target  edges  while 
only  slightly  increasing  noise  level,  and  the  edges 
found  are  likely  to  be  the  most  reliable  ones.  The 
main  drawback  is  that  low  thresholds  increase  the 
time  required  to  fill  the  accumulator  with  votes,  A 
reasonable  starting  guess  is  a  threshold  of  120. 
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As  we  experimented  with  tho  software  packages,  we 
noted  a  great  many  characteristics  that  could  be  improved 
The  preceding  G1IOUGII  edge-threshold  sensitivity,  for 
instance  led  us  to  suggest  that  an  adaptive  edge  detector 
be  used  Our  suggestions  have  covered  everything  from 
the  algorithm  to  the  characteristics  of  larger  systems  that 
might  incorporate  these  routines. 

5.3.  Summaries 

We  have  also  tried  to  summarize  our  findings,  drawing  on 
our  experience  with  other  image  analysis  systems.  Each  of 
Hip  reports  ends  with  such  a  summary.  For  the  PHOENIX 
system,  our  observations  included  the  following' 

The  PHOENIX  segmentation  system  is  one  of  several 
existing  systems  for  recursively  segmenting  digital 
images.  Its  major  contributions  are  the  optional  use 
of  multiple  thresholds,  spatial  analysis  for  choosing 
between  good  features,  and  a  sophisticated  control 
interface  Some  of  the  strengths  and  weaknesses  of 
the  PHOENIX  algorithm  are  listed  below 

PHOENIX  like  other  region-based  methods,  always 
yields  closed  region  boundaries.  This  is  not  true 
of  edge-based  feature  extraction  methods,  with  the 
possible  exception  of  boundary  following  and  zero- 
crossing  detection  Closed  boundaries  are  the  essence 
of  segmentation  and  greatly  simplify  certain  classifi¬ 
cation  and  mensuration  tasks. 

PHOENIX  is  a  hierarchical  or  recursive  segmenter, 
which  means  that  even  a  partial  segmentation  may 
be  useful  This  can  save  a  great  deal  of  computation 
if  efforts  are  concentrated  on  those  regions  where 
further  segmentation  is  critical  If  PHOENIX  is  to  he 
driven  to  its  limits,  other  methods  of  segmenting  to 
small,  homogeneous  regions  may  be  more  economical. 

PHOENIX  is  relatively  insensitive  to  noise  Thresh¬ 
olds  are  determined  by  the  feature  histograms,  where 
noise  tends  to  average  out.  This  contrasts  with  edge- 
based  methods,  where  the  local  image  characteris¬ 
tics  c  :i  be  highly  perturbed  by  noise. 

PHOENIX  ha<  no  notion  of  boundary  straightners  or 
smoothness  This  may  he  good  or  had  depending  on 
the  scene  eharacteristics  and  the  analysis  task  It 
easily  extracts  large  homogeneous  regions  that  may 
be  adjacent  to  detailed,  irregular  regions  (e  g.,  lakes 
adjacent  to  dock  areas  or  sky  above  a  city);  such  tasks 
can  be  difficult  for  edge-based  segmenters. 

PHOENIX  tends  to  miss  small  regions  within  large 
ones  because  they  contribute  so  little  to  the  composite 
histogram  It  is  thus  poorly  suited  for  detecting 
vehicles  and  small  buildings  in  aerial  scenes,  although 
there  may  be  ways  to  adapt  it  to  this  use  It  aL  o  tends 
to  misplace  the  boundary  between  a  large  region  and  a 
small  one,  thus  obscuring  roads,  rivers,  and  other  thin 
region*  Boundaries  found  by  edge-based  methods  are 
less  affected  by  distant  scene  properties. 


PHOENIX  inay  also  fail  to  detect  even  long  and 
liighly-visible  boundaries  between  two  similar  regions 
if  the  region  textures  cause  their  histograms  to  over¬ 
lap  Edge-based  methods  are  better  able  to  detect 
local  variations  at  the  boundary 

Since  perfect  segmentation  is  undefined,  PHOENIX 
must  oversegment  an  image  in  order  to  find  all  region 
boundaries  that  may  be  of  use  to  any  higher-level 
process  It  is  left  for  a  segmentation  editing  step 
to  merge  segments  that  have  no  usefulness  for  some 
particular  purpose,  Without  having  such  a  step,  or 
indeed  even  a  purpose,  it  is  very  difficult  to  evaluate 
the  segmenter  output 

6.  Conclusions 

Our  evaluation  efforts  have  documented  a  great  many 
suggestions  for  improving  the  evaluated  software.  We 
have  tried  to  be  as  quantitative  and  rigorous  as  possible, 
but  the  results  are  necessarily  subjective.  Often  we  have 
functioned  as  restaurant  or  theater  critics  do.  reporting 
our  impressions  of  the  contributions.  These  informed 
opinions,  combined  with  onr  more  rigorous  documentation, 
should  provide  a  good  basis  for  more  specific  evaluation 
efforts  directed  at  particular  task  scenarios  and  production 
environments  Onr  evaluation  reports  and  the  SRI  Testbed 
environment  make  the  contributed  programs  available  as 
benchmark  systems  and  as  research  tools. 
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ABSTRACT 

This  paper  reviews  the  current  status  of  an 
ongoing  program  to  demonstrate  the  application  of 
DARPA  Image  Understanding  research  to  a  photo 
interpretation  system  using  real  imagery.  The 
vision  system  developed  thus  far  is  based  on  the 
Acronym  vision  system,  developed  by  Rod  Brooks  and 
Tom  Binford  at  Stanford  University  on  the  DARPA 
Image  Understanding  Project.  This  system  has 
provided  the  basis  for  a  sophisticated  vision 
capability,  implemented  on  a  general  purpose 
computer.  In  particular,  the  system  includes 
symbolic  prediction  of  shapes  from  both  oblioue 
and  vertical  camera  views,  prediction  of  shadow 
shapes,  a  model-driven  shape  extraction  system, 
area  of  interest  operate-*,  and  a  situation 
assessment  module  to  perform  temporal  analysis. 
Having  recently  completed  3nd  exercised  the  first 
system  implementation,  this  discussion  will 
include  both  details  of  system  design  and  issues 
related  to  performance  and  future  extension. 


Hughes,  with  support  from  DARPA  and  ONR,  is 
conducting  a  program  to  apply  research  results 
from  the  DARPA  Image  Understanding  Project  [7]. 
This  program  has  combined  several  existing  imaqe 
understanding  components,  along  with  new 
approaches,  to  address  problems  associated  with 
automated  photo  interpretation. 

The  system  which  has  been  developed  attempts 
to  identify  interesting  objects  by  matching  shapes 
extracted  from  digitized  images  to  shapes 
generated  by  geometric  analysis  of 
three-dimensional  object  models  and  information 
describing  the  cameia  and  Illumination  conditions. 
Low-level  processing  provides  identification  of 
likely  areas  of  Interest  in  each  scene,  and 
edge-based  lines  extracted  from  these  regions  are 
provided  as  input  for  shape  extraction. 
Descriptions  of  predicted  shapes,  representing 
both  visible  portions  of  object  sub-parts  and 
shadows  cast  from  these  sub-parts,  are  also 
provided  to  direct  the  shape  extraction  process. 
Extracted  and  predicted  shapes  are  finally 
compared,  and  matching  shape  dimensions  and 
spatial  relationships  are  used  to  determine 
Instances  of  both  generic  objects  and  specific 
modeled  sub-classes.  In  addition,  object 
identifications  through  a  sequence  of  images  may 


be  interpreted  by  a  script-based  situation 
assessment  module  which  is  capable  of  improving 
object  ident i f ici t ions  and  making  inferences  about 
the  actions  of  objects. 

This  paper  will  describe  the  current  vision 
system  and  situation  assessment  implementations 
and  summarize  areas  where  additional  extension  is 
necessary.  In  Section  2,  specific  system  modules 
will  be  described,  both  in  terms  of  approach  and 
ultimate  performance  when  applied  to  tjpical 
application  imagery.  Plans  for  extension  of  this 
system  in  an  anticipated  second  contract  phase 
will  be  discussed  in  Section  3. 

2  -  SYSTEM  COMPONENTS 

The  vision  system  structure  for  the  image 
understandi ng  system  has  been  provided  by  the 
Acronym  system  [3-6],  developed  by  Rod  Brooks  and 
Tom  Binford  at  Stanford  University  on  the  DARPA 
Image  Understanding  Project.  This  system  provided 
a  powerful  set  of  building  blocks,  including  a 
rule  syster.i,  slot  /  filler  style  record  package, 
and  a  constraint  manipulation  system  supported  by 
an  algebraic  simplifier.  Higher  level  Acronym 
modules  provided  vision  system  components  which, 
although  requiring  substantial  extension  by  both 
Hughes  staff  members  and  by  Dr.  Brooks  to  support 
a  more  general  image  understanding  capability, 
formed  a  powerful  working  vision  system  structure. 

In  addition  to  the  capabilities  provided  by 
the  Acronym  system,  additional  components  were 
added  to  provide  contrast  enhancement,  area  of 
interest  identification,  image  scaling,  line 
extraction,  and  temporal  situation  assessment. 

The  remainder  of  this  section  will  describe 
each  portion  of  the  image  understanding  vision 
system  in  more  detail,  including  comments  about 
the  actual  performance  of  each  module  when  applied 
to  typical  application  imagery.  The  entire  system 
has  been  implemented  on  a  VAX  11/780  general 
purpose  computer,  utilizing  the  VMS  operating 
system,  Eunice  (a  Unix  emulation  package),  and  the 
Franzlisp  lisp  implementation.  Selected  low  level 
processing  steps  were  implemented  in  the  Fortran 
or  C  programing  languages.  Mean  filter 
operations  were  performed  utilizing  a  DeAnza  IP 
8500  Video  Display  Processing  unit. 
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2.1  -  PRE-PROCESSING 

Imago  digitization  techniques  may  result  In 
distorted  object  dimensions  caused  by  oversampling 
or  undersampling  of  the  original  photograph  during 
the  digitization  process.  Since  dimensional 
Information  Is  of  key  Importance  In  the  operation 
of  the  vision  system,  utilities  have  been  provided 
to  rescale  the  digital  data  uslno  bl-llnear 
Interpolation  technlaues.  Interpolation  may 
result  In  significant  edge  degradation,  manifested 
as  broken  representations  of  otherwise  dominant 
and  continuous  edges.  This  effect  Is  minimized  by 
the  application  of  a  mean  filter,  typically  using 
three  by  three  or  five  by  five  masks,  depending  on 
the  range  of  Interpolation. 

2.2  -  AREA  OF  INTEREST  OPERATOR 

Determination  of  Interest  areas  In  a  scene  Is 
approached  by  determining  whether  local  sixteen  by 
sixteen  pixel  regions  In  ar.  Image  represent  a 
portion  of  an  object  or  background.  Masks,  sized 
appropriately  for  the  objects  of  Interest  In  a 
given  application  and  for  a  limited  range  of  image 
resolution,  are  applied  across  the  Image  at 
fifteen  degree  incremental  rotations.  Each  area 
thus  measured  Is  "scored"  according  to  the  number 
of  object  points  contained  within  the  area.  Areas 
with  a  sufficiently  high  score  are  then  deemed 
"Interesting"  and  later  serve  to  restrict  the 
search  area  for  the  remainder  of  the  vision 
system. 

In  order  to  Identify  object  points,  the  Image 
data  Is  first  contrast  enhanced  using  a  three  part 
piecewise  linear  lookup  table,  where  the  points  of 
discontinuity  are  determined  manually  by 
heuristics  driven  by  the  Image  histogram. 

Successive  median  filters  and  an  optional  Nagao 
smoothing  operator  are  applied  to  significantly 
smooth  Intensity  edges.  This  enhanced  Image  Is 
then  processed  to  determine  local  mean  and 
standard  deviation  values  within  sixteen  by 
sixteen  non-overlapping  blocks.  The  determination 
of  object  vs.  background  for  each  block  is  based 
an  the  derivative  of  these  measures,  using  a 
constant  decision  surface  determined  by  training 
over  a  range  of  test  Imagery  related  to  the 
appl Icatlon. 

Several  comments  can  be  made  regarding  the 
performance  of  this  area  of  interest  operator. 
First,  the  technique  seems  well  suited  to  the 
current  application,  where  background  Intensities 
tend  to  be  fairly  homogeneous  and  uncluttered. 

The  algorithm  was  able  to  consistently  identify 
object  areas  In  Imagery  and  reject  clutter. 

However,  It  Is  believed  that  applications  where 
the  background  Is  likely  to  be  highly  cluttered 
may  require  a  more  complex  set  of  discriminant 
values  and  a  more  carefully  trained  decision 
surface. 

2.3  -  LINE  EXTRACTION 

The  extraction  of  lines  from  Imagery  Is  of 
key  Importance,  since  these  lines  provide  the  sole 
Information  from  which  observed  shape  descriptions 


are  determined  In  this  system.  In  the  current 
implementation,  the  Nevatla  /  Babu  linefinding 
algorithm  has  been  implemented  In  the  C  languaqe, 
with  some  modifications  to  the  bridging  and 
linking  algorithms.  This  linefinding  system  has 
proven  a  valuable  means  of  extracting  most 
necessary  edges  separating  object  subparts  from 
background.  However,  two  specific  problem  areas 
have  been  Identified  which  will  ultimately  require 
a  more  specialized  approach  to  edge 
Identification. 

The  first  range  of  segmentation  problems 
involve  edges  which  are  easily  discerned  by  human 
observers,  but  which  fail  to  be  extracted  by  the 
Nevatia  /  Babu  algorithm.  Typically,  these  edges 
are  not  characterl zed  by  a  significant  intensity 
gradient,  but  rather  the  border  of  a  significant 
change  In  texture  regions.  A  second  set  of 
problems  (often  overlapping  with  the  texture 
border  problem)  are  related  to  the  extraction  of 
very  soft  shadow  boundaries.  In  these  cases,  the 
intensity  gradient  across  the  boundary,  although 
consistent  across  the  entire  shadow  edge,  Is  of 
such  a  small  magnitude  as  to  be  Inseparable  from 
background  clutter.  Reduction  of  edge  thresholds 
to  Identify  these  boundaries  would  result  In  an 
unmanageable  number  of  Insignificant  edges  being 
Identified. 

2.4  -  OBJECT  MODELING 

Interesting  objects  are  modeled  by  the 
Acronym  modeling  system  by  three  dimensional 
models  built  from  generalized  cylinder  volumetric 
elements  [1].  These  volume  primitives  are 
represented  In  general  by  a  cross  sectional  face 
which  describes  the  volume  as  It  Is  swept  along  an 
axial  spine,  with  the  dimensions  of  the  face  being 
varied  along  the  axial  sweep.  In  practice,  limits 
imposed  by  the  complexity  of  performing  geometric 
reasoning  later  In  the  vision  system  limits  the 
modeling  package  to  the  use  of  circular  and 
rectangular  faces  swept  along  straight  spines. 

Dimensions  of  objects  and  afflxments  between 
subparts  may  be  described  as  constrained  ranges  of 
values.  Further,  loosely  constrained  descriptions 
may  be  used  to  specify  generic  object  classes, 
while  more  tightly  constrained  descriptions  may  be 
specified  In  parallel  to  describe  subclass 
specializations. 

This  modeling  capability,  along  with  an 
Interactive  graphics  Interface,  was  provided 
Intact  within  the  Acronym  system.  Issues  of 
performance.  In  terms  of  the  ability  to  represent 
objects  with  sufficient  accuracy,  will  be 
discussed  below. 

2.5  -  SHAPE  PREDICTION 

In  the  prediction  module,  each 
three-dimensional  volume  element  In  the  object 
model  Is  used  to  generate  a  set  of  two-dimensional 
shapes  which  represent  the  visible  end  faces  and 
swept  surfaces  of  the  volume  as  seen  from  the 
modeled  camera  position.  The  predicted  2-D  shapes 
are  represented  by  ribbons  (rectangular  shapes). 
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Spatial  relationships  are  described  by  arcs 
linking  these  shapes  In  specific  manners,  for 
example  angle  arcs  and  distance  arcs. 

The  original  Acronym  system  provided  the 
necessary  geometric  reasoning  knowledge  to  provide 
two  dimensional  descriptions  of  shapes  which 
represent  the  three  dimensional  object  subpart 
mudels  as  seen  from  a  modeled  vertical  camera. 
However,  two  major  extensions  were  regulred  to 
provide  the  necessary  shape  prediction 
capabilities  for  a  more  general  vision  system 
capabll Ity. 

To  begin,  It  was  necessary  to  enhance  support 
for  the  prediction  of  shapes  from  an  oblique 
camera  view.  The  prediction  process  functions  by 
building  symbolic  expressions  describing  the 
transformations  between  object,  camera,  and  world 
coordinate  systems  for  the  model  facer  determined 
to  be  visible  from  the  camera  view.  These 
expressions  grow  considerably  more  complex  In  the 
case  of  an  oblique  camera  view,  and  support  for 
the  simplification  of  these  expressions  required 
major  extensions  to  the  geometric  simplification 
module  originally  contained  within  the  Acronym 
system. 

As  now  Implemented,  the  prediction  module  Is 
capable  of  predicting  both  object  subpart  shapes 
and  their  relative  spatial  relationships  for  both 
vertical  and  oblique  camera  views.  This  effort 
was  supported  greatly  both  by  consultation  with 
Dr.  Brooks,  and  by  the  example  of  the  originally 
coded  vertical  camera  case.  The  resulting  system 
Is  limited  only  In  that  modeled  object  positions 
and  orientations  relative  to  the  camera  line  of 
sight  must  be  specified  exactly  for  the  oblique 
camera  case.  This  limitation  Is  overcome  In 
operation  by  operating  the  vision  system  In  two 
passes,  the  first  of  which  applies  an  overhead 
camera  approximation  to  loosely  constrained 
models.  This  results  In  the  Identification  of 
candidate  matches,  which  provide  specific 
positions  and  orientations  for  performing  more 
detailed  predictions  using  detailed  models  and  a 
fixed,  oblique  camera  model. 

The  second  major  extension  was  the  addition 
of  a  rudimentary  capability  to  predict  shadows 
cast  by  object  subparts  from  a  known  Illumination 
direction.  Most  research  In  this  area  has  been 
directed  toward  the  bottom-up  Interpretation  of 
shadows  to  gain  Insight  about  the  object  casting 
the  shadow.  Within  the  scope  of  the  Image 
understanding  vision  system,  an  assumption  has 
been  made  that  we  have  accurate  representations  of 
Interesting  objects.  Also,  there  Is  no  a  priori 
knowledge  of  what  observed  Image  shapes  represent 
cast  shadows.  In  keeping  with  these 
considerations,  the  system  design  was  extended  to 
Include  the  prediction  of  shadow  shapes  from 
modeled  object  subparts  In  a  manner  analogous  to 
the  prediction  of  object  shapes. 

The  Acronym  prediction  system  was  extended  to 
Include  shadow  prediction  for  a  limited  set  of 
objects.  Shadow  prediction  was  designed  and 
Implemented  for  two  cases,  a  right  circular 


cylinder  with  Its  spine  parallel  to  the  shadow 
plane  and  a  rectangular  parallelepiped  with  one  of 
Its  axes  perpendicular  to  the  shadow  plane. 

Given  a  world  coordinate  system.  In  which  a 
camera,  an  Illumination  source,  a  shadow  plane, 
and  a  solid  object  are  defined,  Acronym  attempts 
to  predict  the  dimensions  of  the  shadow  cast  on 
the  shadow  plane  as  they  will  appear  in  the  Image. 
This  shadow  plane  Is  the  xy  plane  In  the  world 
coordinate  system.  The  camera  and  the  object,  as 
well  as  each  of  Its  subparts  (cones),  have  their 
own  coordinate  systems. 

The  first  part  of  the  shadow  algorithm  deals 
with  determining  the  rotation  expression  necessary 
to  perform  transformations  between  the  cone  and 
woild  coordinate  systems  and  the  cone  and  camera 
coordinate  systems.  Most  of  the  computations 
occurring  In  the  shadow  module  are  performed  In 
the  object  (cone)  coordinate  system.  The  shadow 
plane  equation,  the  Illumination  direction  vector, 
and  the  camera  line  of  sight  vector  are  all 
calculated  with  respect  to  the  object  cone.  Then 
some  tests  are  performed  to  Insure  that  neither 
the  camera  line  of  sight  nor  the  Illumination 
vector  Is  parallel  to  the  spine  of  the  object  cone 
(If  the  object  cone  is  a  cylinder),  and  that  the 
Illumination  direction  Is  not  parallel  to  the  x, 
y,  or  z  axis  (if  the  object  Is  a  rectangular 
para’lelepiped).  This  Is  the  point  where  the 
algorithm  becomes  specialized  as  to  particular 
object  type. 

There  are  two  cases  presently  handled  by 
Acronym.  They  are  a  right  circular  cylinder  with 
Its  spine  parallel  to  the  shadow  plane,  and  a 
rectangular  parallelepiped  with  one  of  Its  axes 
perpendicular  to  the  shadow  plane. 

Using  the  Illumination  direction  defined  with 
respect  to  the  coordinate  system  of  a  solid 
object,  It  can  be  determined  whether  or  not  a 
particular  planar  surface  or  face  on  that  object 
is  Illuminated.  This  Is  easily  done  by  examining 
the  dot  product  of  the  surface  normal  and  the 
Illumination  direction  vector.  A  negative  result 
implies  that  the  planar  surface  Is  Illuminated, 
while  a  positive  result  Implies  that  the  planar 
surface  is  not  Illuminated.  In  the  case  of  a 
curved  surface  the  location  of  the  Illumination 
boundary  can  be  calculated.  That  Is,  the  points 
on  the  surface  where  the  illumination  vector  Is 
perpendicular  to  the  surface  normal  can  be 
determined.  This  set  of  points  forms  the  dividing 
line  between  the  Illuminated  and  shadowed 
subcontours  of  the  surface.  In  the  case  of  the 
rectangular  parallelepiped  the  shadowed  sides  are 
found,  while  the  cylinder  case  requires  that  the 
illumination  boundary  be  calculated.  These 
shadowed  faces  and  subcontours  need  to  be 
determined  In  order  to  properly  predict  the  shadow 
cast  by  the  object. 

Initially  it  was  thought  that  undlstortod 
shadow  contours  would  be  able  to  be  predicted 
along  with  their  spatial  relations  and  that  the 
appropriate  shape  distortions  caused  by  the  camera 
angle  could  be  anticipated.  This  approach  would 
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have  paralleled  the  Acronym  algorithm  quite 
closely.  However,  It  was  found  that  In  order  that 
the  subcontours  of  the  cylinder  be  correctly 
predicted,  a  new  algorithm  needed  to  be  adopted 
which  would  Include  projecting  ribbon  vertices 
onto  the  Image  plane  and  finding  the  best 
rectangular  fit  approximating  the  actual  ribbon 
dimensions.  Therefore,  Instead  of  defining  a 
contour  corresponding  to  a  surface  and  predicting 
how  Its  dimension'’  will  change  given  a  set  of 
camera  coordinates,  the  contour  nodes  and 
dimensions  In  the  Image  plane  are  directly 
calculated.  This  Is  done  In  the  case  of  the  right 
circular  cylinder;  however,  the  case  of  the 
rectangular  parallelepiped  Is  handled  differently. 

The  dimensions  of  the  ribbons  corresponding 
to  the  rectangular  solid  are  predicted  by  the 
original  acronym  method.  Its  cast  shadow, 
however.  Is  handled  by  a  separated  routine.  First 
the  apparent  length  (In  camera  coordinates)  of  the 
shadow  cast  by  a  unit  vector  parallel  to  the 
object  spine  Is  calculated.  Next,  bounds  for  the 
actual  shadow  length  are  calculated  In  two  ways: 

o  First,  by  casting  this  shadow  onto  the 
shadow  plane,  using  the  perpendicular 
length  from  the  shadow  plane  to  the  top 
of  the  rectangular  solid  and  adding  this 
perpendicular  length  to  the  length  of  the 
shadow  which  Is  cast;  and 

o  Second,  by  casting  this  shadow  onto  the 
plane  parallel  to  the  shadow  plane 
through  the  base  of  the  rectangular 
solid,  using  the  solid's  vertical  length. 

The  width  Is  calculated  by  using  a  normalized, 
rotated,  and  scaled  version  of  the  cast  shadow 
vector. 

The  right  circular  cylinder  can  give  rise  to  a 
maximum  of  6  ribbons: 

o  SI  -  the  ribbon  representing  the  entire 
curved  surface  of  the  cylinder.  It  can 
contain  2  subribbons,  S2  and  S3. 

o  S2  -  the  Illuminated  part  of  the 
cylinder's  surface. 

o  S3  -  the  self-shadowed  part  of  the 
cylinder's  surface. 

o  S4  -  the  shadow  cast  onto  the  shadow 
plane  by  the  swept  contour  of  the 
cyl  Inder. 

o  S5  -  the  entire  shadowed  area  Including 
S3  and  S4. 

o  S6  -  the  entire  area  covered  by  the 
cylinder  and  Its  cast  shadow.  This 
ribbon  contains  subribbons  SI  through  S5. 


Figure  1.  Predicted  Right  Circular  Cylinder 
Ribbons 

The  rectangular  parallelepiped  can  give  rise  to  a 
maximum  of  4  ribbons: 

o  SI  -  the  ribbon  representing  the  one  or 
two  visible  swept  planar  surfaces  of  the 
rectangular  parallelepiped. 

o  S2  and  S3  -  The  two  visible  planar 
surfaces  of  the  rectangular 
parallelepiped. 

o  S4  -  The  shadow  cast  onto  the  shadow 
plane  by  the  swept  contour  of  the 
rectangular  parallelepiped. 
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Figure  2.  Predicted  Rectangular  Parallelepiped 
Ribbons 

Only  the  S4  ribbon,  the  rectangular  solid's 
cast  shadow.  Is  predicted  In  the  shadow  prediction 
module,  while  all  six  of  the  cylinder's  ribbons 
are  predicted  here.  New  rules  were  added  to  some 
of  the  core  Acronym  rule  sets  such  as  the  spatial 
relation  and  Interpretation  rule  sets  to 
accommodate  the  addition  of  this  new  set  of 
ribbons. 
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2.6  -  SHAPE  EXTRACTION 

In  order  to  provide  a  sufficiently  robust 
snape  extraction  capability,  a  new  module  for  the 
identification  of  rectangular  shapes,  or  ribbons, 
was  Implemented  to  replace  the  original  Acronym 
ribbon  finder  [2].  This  new  design  takes  a 
top-down  approach  to  shape  extraction,  using  the 
predicted  shape  dimensions  to  drive  the  selection 
and  operation  of  rule-based  heuristics  which  build 
shape  descriptions  from  extracted  line  segments. 

The  new  ribbon  finding  module  makes  use  of 
the  results  of  the  area  of  Interest  operator, 
rejecting  line  segments  which  lie  outside  these 
Interesting  regions.  The  surviving  line  segments 
are  then  used  to  Identify  various  line-based  Image 
features,  including  parallel  lines,  collnear 
lines,  and  line  pairs  which  form  vertices.  Since 
the  Identification  of  these  features  Is 
computationally  expensive,  a  library  of  these 
features  Is  built  before  applying  the  various 
ribbon  finding  heuristics,  to  avoid  recomputing 
similar  features  as  each  heuristic  rule  set 
searches  for  particular  shapes.  In  fact,  well 
over  half  the  CPU  time  for  a  typical  ribbon 
finding  application  Is  spent  Identifying  these 
features. 

Having  created  a  library  of  these  features, 
each  predicted  shape  Is  submitted  to  a  set  of 
rules  which  In  turn  applies  appropriate  heuristic 
rule  sets  which  attempt  to  Identify  Instances  of 
that  shape  from  the  line  segments  and  line  based 
features  available.  For  example,  a  long  predicted 
shape  would  trigger  the  Invocation  of  rule  sets 
which  search  for  parallel  lines  or  a  long  segment 
corresponding  to  the  dominant  long  sides  of  the 
Shape.  These  heuristics  make  use  of  the  predicted 
shape  dimensions  to  search,  for  example,  for 
parallel  sides  whose  separation  falls  within  the 
constrained  range  of  width  for  the  predicted 
shape.  Similarly,  features  which  Indicate  the  end 
closings  of  the  shape  are  tested  for  satisfaction 
of  the  predicted  range  of  length. 

Separate  heuristic  rule  sets  are  Invoked  by 
the  selection  rules  depending  on  the  type  of 
shapes  predicted.  Long  shapes  are  Identified,  as 
exemplified,  by  prominent  side  features,  while 
more  square  shapes  are  Identified  by  either 
prominent  sides  or  ends,  or  by  corners.  Small 
shapes  are  more  typically  described  by  a  sets  of 
segments  which  completely  close  the  shape,  and 
rules  are  Implemented  for  this  case  as  well. 
Finally,  shadow  shapes  are  often  supported  by  only 
line  features,  and  heuristics  searching  for  any 
one  of  the  four  sides  are  Included.  As  each  rule 
set  builds  appropriate  ribbon  shapes,  they  are 
appended  to  a  graph  structure  to  provide  access  by 
the  matching  and  scene  Interpretation  module. 

In  practice  on  application  Imagery,  the 
success  of  this  model  directed  shape  extraction 
system  Is  dependent  on  two  related  modules. 

First,  the  low  level  processing  and  lire  finding 
must  be  capable  of  providing  sufficient  quality  of 
features  to  support  the  heuristic  extraction  of 
shapes.  Secondly,  there  Is  the  assumption  that 


the  chain  of  approximations  which  model  object 
subparts  as  generalized  cylinders,  and  predicts 
them  as  ribbon  shapes,  adequately  represents  the 
actual  shapes  to  be  fouid  In  the  Image  data.  If 
this  Is  not  the  case,  th?  model  based  predicted 
shapes  actually  misdirect  the  ribbon  finder.  A 
serious  Instance  of  the  problem  occurs  In  the 
prediction  and  extraction  of  shadow  shapes.  To 
begin,  shadows  are  approximated  as  rectangular 
ribbons,  when  the  actual  calculated  ribbon  would 
be  more  suitably  represented  as  a  parallelogram. 
Since  shadow  boundaries  are  typically  weak,  and 
the  ribbon  finding  heuristics  thus  rely  on  very 
few  features,  It  Is  not  uncommon  for  the  resulting 
extracted  ribbon  to  be  mlsorlented  by  the  angular 
error  resulting  from  the  rectangular 
approximation.  In  addition,  the  predicted  width 
of  the  rectangular  shape  Is  taken  along  the  radius 
perpendicular  to  the  spine  of  the  ribbon,  while 
the  parallelogram  shape  actually  seen  In  the  Image 
data  will  have  end  segments  which  are  not 
perpendicular  to  the  sides  or  spine,  and  are 
longer  than  the  predicted  width. 

Apart  from  these  problems  the  ribbon  finder 
has  performed  quite  well  on  shapes  of  varying 
resolution  and  with  limited  edge  features 
describing  the  Important  shapes.  Performance  Is 
greatly  enhanced,  especially  In  terms  of 
eliminating  clutter  shapes,  If  the  predicted 
shapes  are  tightly  constrained.  Implying  tight 
model  constraints  and  an  accurate  camera  model. 

2.7  -  INTERPRETATION 

The  Interpretation  module  performs  two  broad 
functions  within  the  vision  system.  First,  It 
performs  a  matching  function,  locating  sets  of 
observed  shapes  which  are  consistent  with  a  set  of 
predicted  shapes,  both  In  terms  of  shape 
dimensions  and  relative  spatial  relationships. 
Secondly,  these  match  sets  are  Interpreted  by 
determining  what  modeled  generic  class  or  subclass 
specializations  are  satisfied  by  each  match  set. 

The  first  function,  the  creation  of  match 
sets,  builds  sets  of  nodes  which  contain  the 
predicted  and  extracted  shapes  which  match 
dimensionally.  Graphs  are  built  which  link  these 
nodes  using  arcs  which  describe  the  spatial 
relationships  between  them,  driven  by  the 
predicted  spatial  arcs.  Further,  as  these  graphs 
are  built,  corresponding  restriction  nodes  are 
constructed  which  describe  the  shape  and  arc 
matches  In  terms  of  the  model  and  camera 
parameters,  forming  back  constraints  which 
restrict  the  1  u”rpretat1on  of  each  match.  The 
second  Interpretation  function  determines  the 
satisfiability  of  these  back  constraints,  thereby 
determining  membership  of  the  matched  object  In 
specific  modeled  classes.  These  Interpreted 
graphs  form  the  final  result  of  the  vision  system 
as  applied  to  a  single  scene. 

As  In  the  prediction  module,  the 
Interpretation  system  required  significant 
extension,  especially  to  support  the  matching  of 
shadow  shapes.  These  shadow  cases  reoulred  a 
significant  recoding  to  support  new  spatial  arcs, 
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as  well  as  to  Improve  the  performance  of  the 
matching  and  Interpretation  tasks. 

2.8  -  SITUATION  ASSESSMENT 

In  many  applications,  models  of  temporal 
activities  of  various  Interesting  objects  can 
provide  the  overall  system  with  capabilities 
beyond  those  achievable  with  a  single-frame  vision 
system  alone.  These  applications  Involve 
situations  v+iere  the  objects  have  predictable 
behavior  patterns.  More  specifically,  these 
applications  Involve  objects  whose  observed 
locations  over  a  sequence  of  Images  may  Imply 
their  behavior,  and  the  seouences  of  observed 
activities  form  a  predictable  history. 

Tracking  observed  objects  may  serve  to 
confirm  or  dlsconflrm  the  results  of  the  vision 
system  object  classification.  If  the  vision 
system  mlsclasslfles  an  object,  then  the  situation 
assessment  module  has  the  opportunity  to  detect 
the  error  by  either  noticing  the  change  In  an 
object  Identification  of  a  previously  known  and 
tracked  object,  or  by  uciectlng  an  object  not 
behaving  in  a  normal  manner.  These  situations  may 
be  flagged  or  corrected,  depending  on  the 
confidence  or  ambiguity  In  the  Inferred 
identl flcatlon. 

The  method  Implemented  to  monitor  object 
behavior  Is  based  on  the  concept  of  a  knowledge 
structure  which  describes  expected  behaviors  and 
lengths  of  activities  called  scripts.  These 
scripts  are  matched  against  observed  object 
identification  sequences  to  Infer  activities. 
Inclusion  of  knowledge  of  activity  durations 
allows  further  Inference  even  when  some  stare  of 
the  script  may  not  have  been  observed. 

As  the  script  activities  are  tracked,  various 
ambiguities  may  make  It  necessary  to  carry  along 
multiple  script  Interpretations,  which  are  weeded 
out  as  more  observations  are  made,  until  either  a 
consistent  Interpretation  Is  found  or  all  the 
Interpretations  are  eliminated,  at  which  time  an 
anomalous  situation  would  be  flagged.  This  script 
processing  system  has  been  developed  and  exercised 
using  manual  observation  Inputs. 

3  -  FUTURE  EXTENSIONS 

Tne  most  useful  result  of  the  first 
Implementation  of  an  Image  understanding  vision 
system  Is  the  Identification  of  key  problems  In 
the  photo  Interpretation  task.  These  results  have 
guided  plans  for  future  extension  of  the  system 
toward  a  next-generation  vision  system. 

Beginning  at  the  bottom  level  of  Image 
analysis,  It  has  become  clear  that  more  robust 
means  of  extracting  both  Intensity  edges,  shadow 
boundaries,  and  region  borders  are  needed.  This 
will  be  a  substantial  area  of  effort  In  the  second 
contract  phase.  Region  growing  techniques, 
combined  with  more  generalized  edge  detection, 
will  be  explored. 

While  the  modeling  and  prediction  capability 


of  the  current  Implementation  will  form  an 
Important  piece  of  the  new  vision  system,  extended 
capabilities  will  need  to  be  added  to  perform 
prediction  of  complex  shapes  generated  In 
situations  where  object  occlusions  and  detailed 
shadows  cast  on  nearby  3-D  objects  must  all  be 
considered  to  Identify  Individual  objects.  This 
will  include  very  detailed  models,  with  a  more 
general  capability  to  model  irregular  shapes,  and 
a  projection  scheme  to  predict  exact  shapes  as 
would  be  seen  In  image  data.  These  detailed 
predictions  require  exact  modellnq  conditions. 

The  current  symbolic  prediction  system  will 
provide  an  Important  coarse  pass  capability,  to 
establish  specific  model  positions  and 
orientations . 

Several  Issues  regarding  the  vision  system 
performance  will  be  addressed  by  an  Intelligent 
planner  system.  This  system  will  select  models, 
control  coarse  pass  matching  as  well  as  detailed 
passes,  select  appropriate  low  level  processing  to 
optimize  feature  extraction,  and  select 
appropriate  shape  extraction  heuristics.  Shape 
extraction  will  be  further  enhanced  by  the 
direction  of  shape  extraction  In  an  ordered 
manner,  searching  first  for  dominant  subpart 
shapes,  and  then  directing  local  searches  for  more 
detailed  subparts  and  shadow  shapes  based  upon 
modeled  spatial  relationships. 

At  an  even  higher  level  of  control,  an  expert 
system  will  attempt  to  emulate  some  of  the 
procedural  expertise  applied  to  the  Image 
understanding  task  by  a  human  photo  Interpretation 
specialist.  This  knowledge  will  direct 
application  of  the  vision  system  to  perform 
specific  analysis  tasks,  such  as  scene 
mensuration.  It  will  also  determine  what  results 
are  Important,  when  satisfactory  results  have  been 
obtained  from  the  vision  system,  and  will  be 
capable  of  the  sort  of  temporal  analysis  now 
captured  In  the  situation  analysis  module. 

4  -  CONCLUSIONS 

Having  concluded  Phase  One  efforts  on  the 
Image  Understanding  program,  significant  progress 
has  been  made  toward  technological  capabilities  to 
automate  portions  Of  the  photo  Interpretation 
task.  In  Its  present  form,  the  system  Is  capable 
of  Identifying  generic  classes  and  subclasses  of 
objects  by  matching  basic  dimensions  and  spatial 
relationships.  Performance  Is  rudimentary  In 
comparison  to  human  capabilities;  several  areas 
have  been  Identified  for  Improvement.  Individual 
progress  In  the  areas  of  shape  and  shadow 
prediction,  shape  extraction,  and  scene 
Interpretation  all  have  contributed.  However,  the 
most  significant  mark  Is  the  Inclusion  of  a 
complete  set  of  modules  for  performing  object 
Identification  In  real  Imagery  In  a  single 
Integrated  system,  providing  a  basis  for  future 
extension  as  well  as  evaluation  of  other 
techniques . 
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ABSTRACT 

Many  groups  are  now  attempting  to  build  autonomous  land 
robot  However,  no  vehicle  fitting  this  description  exists  today. 
The  difficult  problems  that  must  be  solved  are  obstacle  avoid¬ 
ance  and  scene  matching  to  map  coordinates, 

Autonomous  robot  vehicles  would  be  used  primarily  by  the 
military  for  reconnaissance.  Enr  n>  targets  would  cither  be  desig¬ 
nated  by  laser  spotting  or  their  locations  and  identities  transmit¬ 
ted  to  friendly  forces. 

Wcstinghousc  as  a  subcontractor  to  the  Computer  Vision  Lab 
of  the  University  of  Maryland,  with  the  support  of  the  Defense 
Advanced  Research  Projects  Agency  and  the  U.S.  Army  Night 
Vision  and  Electro-Optics  Laboratory,  is  developing  a  test-bed 
facility  for  investigating  autonomous  vehicle  nav  igation.  AWest- 
inghousc  vision  system  will  be  added  to  an  existing  electronically 
controllable  Army  Vehicle. 


I.  INTRODUCTION 

Since  April  1976,  the  Computer  Vision  Laboratory  of  the  Uni¬ 
versity  of  Maryland,  in  collaboration  with  Wcstinghousc  as  a 
subcontractor,  has  been  engaged  in  Image  Understanding  re¬ 
search  with  the  support  of  the  Defense  Advanced  Research  Pro¬ 
jects  Agency  and  the  U.S.  Army  Night  Vision  and  Electro-Optics 
Laboratory.  This  research  has  resulted  in  a  number  of  significant 
advances  in  the  field  of  image  understanding,  with  particular 
emphasis  on  techniques  applicable  to  tactical  imagery.  A  new  re¬ 
search  project  has  started  which  will  build  on  this  work  to  de¬ 
velop  an  autonomous  vehicle  navigation  ays' cm 

Section  2  discusses  the  need  for  autonomous  vehicle  naviga¬ 
tion,  The  mam  military  application  is  the  detection,  recognition, 
and  reporting  of  rear-echelon  enemy  targets.  This  mission  can  be 
accomplished  by  land  or  ait  The  unmanned  ground  vehicle  of 
fers  some  advantages  over  its  airborne  counterpart,  while  allow¬ 
ing  penetration  into  territory  generally  considered  too  risky  for  a 
man  Section  3  surveys  the  state-of-the-art  in  vehicle  develop¬ 
ment  Land,  air,  and  undersea  vehicles  are  all  covered  since  many 
of  the  computer,  command-and-control,  and  sensor  problems  arc 
common  to  them.  Section  4  describes  a  test  bed  facility  proposed 
for  implementation  at  West inghv  use.  The  test  vehicle  will  have  a 
real-time  Wcstinghousc  vision  sv  tern  on  board,  One  goal  of  this 
project  is  to  test  software  developed  by  other  research  groups,  as 
well  as  that  originating  at  Wcstinghousc  and  the  University  of 
Maryland 


2.  THE  NEED  FOR  AUTONOMOUS  VKIIICl  F 
NAVIGATION 


The  Army's  emerging  doctrine  of  "deep  attack”  on  second  and 
later  echelons  of  the  enemy  offensive  requires  that  artillery  and 
aircraft  strikes  be  directed  against  these  more  distant  targets. 
Successful  use  of  this  fire  power  is  depe  dent  upon  obtaining 
accurate  target  positions  and  the  ability  to  piovidc  precise 
weapon  delivery.  Current  developments  in  fire-and -forget 
weapons  offer  the  prospect  of  major  advances  in  weapon  deliv¬ 
ery  precision,  once  the  targets  are  located.  However,  initial  detec¬ 
tion  and  location  of  rear-echelon  targets  remain  difficult  recon¬ 
naissance  tasks.  This  project  is  directed  toward  the  solution  of 
this  problem. 


The  desired  reconnaissance  can  be  accomplished  by  land  or  air 
The  land  mission  is  presently  accomplished  by  “spotters"  who 
are  positioned  behind  enemy  lines  to  locate  and  report  by  radio 
on  enemy  target  concentrations  (figure  2-1), 

Forward  air  controllers  assist  in  air  strikes  which  employ  laser 
guided  missiles  by  designating  key  targets  with  laser  beams.  The 
risks  involved  in  these  assignments  have  spawned  various  at¬ 
tempts  to  automate  the  spotter  process.  Sensor  packages  for  ve¬ 
hicle  detection  were  developed  at  the  time  of  the  Vietnam  con¬ 
flict  for  air  drop  in  enemy  territory.  One  of  the  major  problems 
faced  by  this  Air  Force  program  was  the  need  to  establish  the 
exact  locations  of  the  sensor  packages  themselves.  Another  was 
the  fact  that  these  immobile  packages  often  came  to  rest  in  posi¬ 
tions  from  which  no  detections  could  be  made,  or  from  which 
their  detection  by  the  enemy  was  easy.  In  an  Army  program  car¬ 
ried  out  about  the  same  time,  a  mobile  system  was  developed  for 
protection  of  railroad  equipment  by  placing  sensors  on  a  small, 
unmanned  railroad  car  moving  at  a  safe  distance  ahead  of  the 
locomotive.  The  function  of  the  sensors  was  to  detect  any  anom¬ 
alies  in  the  track  condition,  including  placement  of  mines  near 
the  tracks. 


,,  ,  ,  '  ■  — . .  oiivinpis  to  auto¬ 

mate  'he  land  reconnaissance  activity  indicate  the  value  of  close- 
up  target  information.  However,  collection  by  air  is  an  alternate 
approach,  .standoff  target  acquisition  systems,  which  use  long 
range  sensors  located  above  friendly  territory,  lack  the  resolution 
if  not  the  range)  to  provide  detailed  data  on  rear-echelon  targets 
Low-level  reconnaissance  aircraft  which  penetrate  enemy  tern 
tor>  are  severely  restricted  by  the  time  available  for  target  search 
and  by  the  masking  of  trees  or  buildings.  Helicopters  offer  in' 
creased  search  time  at  the  risk  of  reduced  survivability.  Remotely 
Piloted  vehic  es  (RPVs)  will  relieve  some  of  the  constraints  on 
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search  time  and  survivability,  but  are  not  expected  to  produce 
information  which  is  competitive  with  ground  reconnaissance. 
However,  the  navigation  techniques  to  be  developed  lor  this  pro¬ 
ject  will  apply  to  aircraft  as  well  as  land  vehicles. 


Step  1 

•  Forward  Spotter  Clacks  Map 
To  Determine  Hia  Own 
Position 

Step  2 

•  Spotter  Locales  Enemy 
Targets 

•  Spotter  Estima'es  Target 
Locations  m  Map 
Coordinates 


Step  3 


•  Spotter  Transmits  Targe* 
locations  to  Friendly  Forces 
Bv  Field  Radio 

•  Spotter  Msy  Designate 
Targets  With  a  Lsser 
Designator 


figure  2-1.  forward  Spotter  Scenario 


The  above  discussion  suggests  (he  need  for  an  autonomous 
land  vehicle  which  can  be  delivered  deep  inside  enemy  territory 
and  which  can 

•  establish  us  position  in  map  coordinates, 

•  navigate  in  a  limited  way  through  its  environment, 

•  detect  and  identify  targets  and  transmit  their  locations 
and  identities  to  friendly  forces,  or  designate  them  by  la¬ 
ser  spotting. 

This  concept  is  illustrated  in  figure  2-2. 


The  functions  of  an  autonomous  land  vehicle,  as  stated  above, 
represent  several  areas  of  advanced  research,  including  scene 
matching  to  map  coordinates,  au’omatic  target  recognition,  ob¬ 
stacle  avoidance,  and  robotics.  All  of  these  areas  have  ocen  un¬ 
der  recent  investigation;  of  (he  four,  however,  it  is  felt  that  scene 
matching  to  map  coordinates  will  require  by  far  the  greatest  ad¬ 
vances  in  order  to  establish  the  feasibility  of  the  land  vehicle  con¬ 
cept  The  robotics  area  is  currently  receiving  a  high  level  of  re¬ 
search  support,  both  military  and  industrial,  which  is  expected  to 
solve  many  of  the  control  problems. 

Automatic  target  recognition  dev.ccs  which  discriminate  be¬ 
tween  target  classes  are  now  in  the  advanced  development  stages 


in  the  Army  and  Air  Force.  Scene  matching  between  similar 
views  of  the  same  area  is  also  a  well-established  capability.  How 
ever,  the  challenge  for  the  land  vehicle  is  to  determine  its  position 
in  map  coordinates  based  upon  a  series  of  narrow  field  observa 
lions  taken  from  a  limited  scries  of  vantage  points. 

Another  challenge  is  obstacle  avoidance.  Large  military  vchi 
cles  have  little  difficulty  traversing  ruts  and  small  obstacles,  but 
must  avoid  barriers,  "traps,"  and  cliffs.  Most  current  research 
on  obstacle  avoidance  centers  around  the  use  of  laser  range  data 

2.1  Autonomous  (iround  Vehicles 

As  stated  above,  the  ground  vehicle  should  have  the  ability  to 
perform  limited  maneuvers  so  as  to  improve  its  vantage  point, 
and  avoid  immediate  detection.  It  must  be  capable  of  delecting 
and  avoiding  large  obstacles,  based  upon  the  interpretation  of 
imagery  obtained  from  its  sensors.  Infrared,  TV,  laser  range,  and 
radar  sensors,  as  well  as  combinations  of  them,  will  be  consid¬ 
ered  for  use  in  ihis  project. 

The  land  vehicle  possesses  several  advantages  over  the  air 
borne  platform.  Ii  offers  long  periods  of  observation  at  ranges  at 
which  detailed  target  identification  can  be  carried  oul  Its  posi¬ 
tion  can  be  quite  accurately  defined.  Furthermore,  it  presents  a 
stable  platform  for  designation. 


Figure  2-2.  (iround  Kohot  Vehicle  Scenario 


2.2  Autonomous  Airborne  Vehicles 

The  Remotely  Piloted  Vehicle  (RPV)  offers  another  solution  to 
the  reconnaissance  problem  (figure  2-3).  However,  an  RPV  see 
nano  (figure  2-4)  introduces  complications  of  its  own  In  its  cur 
rent  state  of  development  a  video  link  is  needed  between  the  RPV 
and  the  ground.  Imagery  is  transmitted  over  this  data  link,  and 
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the  image  processing  tasks  of  target  detection  and  recognition  arc 
performed  at  a  ground  control  station.  The  limitations  of  this 
approach  arc  its  bandwidth  requirements  and  the  possibility  of 
the  enemy  jamming  or  intercepting  the  transmission.  These  limi¬ 
tations  can  be  overcome  by  reducing  the  data  to  be  transmitted, 
encoding  it,  and  having  the  RI’V  operate  autonomously,  using 
onboard  sensor  data. 

One  solution  (figure  2  5)  is  to  have  an  onhoard  vision  system 
which  will 

•  navigate  the  air  vehicle  through  the  environment, 

•  locale,  identify,  and  track  enemy  targets,  and 

•  transmit  target  locations  and  mark  them  with  coded  laser 
spots  to  direct  Hclltire  antitank  missiles  and  precision 
guided  o  dnancc. 


Figure  2-3.  Launch  of  Aquila  Remotely  Piloted  Vehicle 


2.3  Application  to  Current  Scenarios 

Fven  though  the  proposed  vision  system  is  intended  for  use  in 
an  unmanned  vehicle,  it  may  also  find  application  in  a  manned 
vehicle,  such  as  a  helicopter,  its  lection  would  be  to  partially  or 
wholly  assume  certain  visual  tasks  of  the  pilot,  so  as  to  reduce  the 
burden  on  him  during  battlefield  conditions.  In  particular,  the 
ability  to  ’’hand  off”  acquired  targets  to  a  ground  pilot  or  to 
another  helicopter  in  map  coordinates  is  greatly  desired  as  an 
automatic  function,  but  is  presently  not  available. 

3.  A  SURVEY  OF  AVAILABLE  AUTONOMOUS  AM) 

REMOTELY  CONTROLLED  VEHICLES 

Despite  the  impression  given  to  the  general  public  by  science 
fiction  movies  and  the  popular  press,  autonomous  outdoor  land 
robots  arc  not  now  in  existence.  However,  some  experimental 
test-bed  facilities  are  now  being  built.  The  state  of  the  art  is  con¬ 
siderably  more  advanced  for  remote-controlled  vehicles;  air,  un¬ 
dersea,  and  land  vehicles  arc  now  in  production.  We  will  review 
work  in  these  areas  now  taking  place  in  the  United  States.  The 
considerable  research  and  development  effort  now  underway 
outside  the  U.S.  will  also  be  mentioned. 

3.1  Remote-Controlled  Vehicles 

A  large  number  of  remote-controlled  vehicles  have  been  built 
for  use  in  the  air,  under  water,  and  on  land.  Much  of  the  technol¬ 
ogy  developed  for  remotely  controlled  vehicles  can  be  applied 
directly  to  autonomous  vehicles. 


/ 

/ 

/ 

/ 

/  s- 

"-vU 

/ 


St*p  t  •  Airborne  Bobo!  Vehicle  •  Vi*u*i  Syiurp  OfttcL  Moo'iUm.  end  Inch*  T«rgtt 


Beremetric  Representation  o'  Teigete  "*ogr«mm*<j  into  Profect>f(  Seeker 
•  Rrofectite  >•  Fired 


Figure  2-4.  Current  RP\  Scenario 

Air  Vehicles 

The  U.S.  Army  awarded  a  contract  to  Lockheed  in  August 
i979  for  the  full-scale  development  of  the  Aquila  Remotely  Pi¬ 
loted  Vehicle  (RPV).  Delivered  hardware  will  consist  of  22  air 
vehicles,  four  ground  control  stations,  three  launcher  subsys¬ 
tems,  three  recovery  subsystems,  and  18  mission  payload  subsys¬ 
tems.  Wcstinghousc  is  designing  and  building  the  mission  pay- 
load  subsystem,  which  consists  of  a  stabilized  daylight  TV  sensor, 
laser  rangcfindcr/designator,  stabilized  optical  path,  auto- 
tracker,  and  associated  electronics  and  controls.  Once  a  target  is 
located  by  slewing  the  TV  camera  through  controls  in  (he  ground 
station,  the  camera  can  be  switched  to  automatic  track.  The 
borcsight  laser  can  be  activated  either  to  provide  range  to  targets 
or  to  designate  targets  for  precision  guided  munitions. 

The  Aquila  operator’s  console  includes  a  telelvpc  for  inserting 
the  planned  flight  profiles;  airspeed,  altitude,  and  heading  con¬ 
trols;  displays  for  manual  air  vehicle  control;  an  X-Y  plotter  for 
monitoring  aircraft  position  on  a  map;  and  an  alphanumeric  ter¬ 
minal  to  display  vehicle  statu*.  The  mission  payload  console  pro¬ 
vides  a  similar  interface  between  the  operator  and  mission  pay- 
load  subsystem.  The  video  sensor  can  be  controlled  in  azimuth 
sweep,  elevation,  three-position  zoom,  autotrack,  and  laser  aim/ 
fire  functions.  Both  consoles  provide  real-time  video  display  with 
instant  replay  capability. 
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•  Robot  Vehicto  Transmit  Target  Location*  in  M»p  CoortJinal»»  ’■)  Fira  Control  Station 

•  Robot  Veblct*  Dealgnalea  Target  With  a  Laser 


Figure  2-S.  Proposed  Scenario  for  Airborne  Robot  Vehicle 

TWo  RPVs  arc  produced  in  Israel:  the  Israel  Aircraft  Industry 
Scout  and  the  Tadiran  Mastif.19  They  were  used  for  reconnais¬ 
sance  in  the  destruction  of  Syrian  surface-to-air  missile  batteries 
in  Lebanon’s  Bckaa  Valley  in  June  1982.  They  located  the  precise 
positions  of  SAM  sites  and  relayed  the  video  data  to  a  ground 
station  in  real-time.  The  Israeli  RPVs  arc  less  costly  and  contain 
a  less  sophisticated  payload  than  their  American  counterpart 
Land  Vehicles 

A  number  of  remotely  controlled  vehicles  have  been  built  for 
duty  too  hazardous  for  a  human  Vehicles  have  been  built  for 
travel  through  contaminated  environments,  such  as  the  Three 
Mile  Island  nuclear  power  plant.  In  fact,  the  Atomic  Energy 
Commissions  of  most  industrialized  nations  have  remote- 
controlled-vehicle  programs  The  two  main  U.S.  companies 
building  remotely  operated  vehicles  arc  GCA/PaR  Systems  and 
Tracor  MBA. 

GCA/PaR  has  built  about  15  remotely  operated  vehicles.  We 
will  describe  their  PaR-l  Manipulator  VeHclc  (PMV)  shown  in 
figure  3-1  The  PMV  provides  a  mobile,  maneuverable  platform 
for  robot  arms  and  TV  cameras.  The  vehicle  rides  on  two  wide, 
flat,  neoprene  belted  tracks.  The  two  rear  wheels  are  powered  by 
separate  DC  motors  The  vehicle  is  turned  by  running  one  of  the 
motors  faster  than  the  other,  or  by  driving  them  n  opposite  di¬ 
rections,  The  vehicle  contains  a  double-telescoping,  rotating, 
tube  assembly  upon  which  manipulator  arms  and  'V  cameras 
can  be  mounted  TWo  highly  dexterous  manipulat<  arms  are 
available,  with  load  capacities  of  100  and  160  pounds,  respec¬ 
tively.  TV  camera  arms  arc  mounted  on  the  same  assembly  as  the 


manipulator  arm  so  that  camera  motions  are  synchronized  in 
vertical  movement  and  rotation  with  the  mampi  ''tor  shoulder 
housing.  The  manipulator  can  operate  a  number  of  tools  includ 
ing  saws,  shears,  fasteners,  and  torches.  It  can  switch  between 
them  via  remote  control. 


Figure  3-1.  PaR  Manipulator  Vehicle 


The  PMV  control  console  is  portable.  It  connects  to  the  vehicle 
by  a  flexible,  quick-disconncct  cable.  The  console  provides  the 
necessary  control  and  power  for  the  vehicle  and  robot  arm.  It 
can  also  include  video  monitors  and  camera  controls.  The  fol¬ 
lowing  functions  of  the  TV  camera  can  be  controlled  from  the 
console:  lens  zoom,  iris  setting,  pan  and  tilt,  and  movement  of 
the  camera’s  positioning  arms. 

Tracor  MBA  has  built  about  20  remote-operated  vehicles. 
Their  ’’centipede"  has  six  wheels,  TV  cameras,  and  front- 
mounted  manipulator.  It  is  capable  of  traveling  over  rough  ter¬ 
rain.  It  can  traverse  33-degree  slopes,  3-foot  vertical  barriers, 
and  even  climb  up  an  down  stairs.  The  manipulator  arms  and  TV 
cameras  move  as  slaves  to  human  controlled  masters  in  the  vehi¬ 
cle’s  remote  control  console.  A  human  controls  the  arms  by  an 
cxoskelcton  over  his  own  arms.  The  manipulator  arms  have 
force  feedback  to  let  the  human  operator  sense  the  weight  of 
objects  remotely  handled.  The  human  operator  controls  the  gi ip- 
ping  force;  it  is  stated  that  he  can  get  the  remote-controlled  arms 
and  hands  to  perform  very  delicate  tasks  such  as  threading  a 
needle. 

The  vehicle’s  two  TV  cameras  can  be  controlled  by  a  unit 
mounted  on  the  operator’s  head.  The  operator’s  biocular  display 
provides  him  with  a  stereo  view.  The  movement  of  the  slave  TV 
cameras  is  controlled  by  the  operator’s  head  movement.  Control 
signals  and  video  are  transmitted  cither  by  cable  or  by  an  RF 
telemetry  link.  The  cable  can  be  up  to  300  feet  long,  the  control 
trailer  can  be  up  to  15  miles  away  if  a  telemetry  link  is  used. 

Other  MBA  remote-controlled  land  units  include  a  tracked  ve¬ 
hicle,  built  for  the  Air  Force;  an  antiarmor  vehicle  and  fork  lift, 
built  for  the  Army;  an  underground  mining  vehicle  with  mampu 
lator,  built  for  the  Bureau  of  Mines;  and  a  driverless  tractor  re¬ 
mote  handling  system,  built  for  Lawrence  Livermore  lab 
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Undersea 

Hundreds  of  remotely  controlled  underwater  vehicles  have 
been  built  The  reader  is  referred  to  the  proceedings  of  the  March 
1983  conference  on  remotely  operated  undersea  vehicles.1 

An  example  of  the  state  of  the  art  is  the  Surface  Towed  Searcn 
System  developed  for  the  Navy  by  Westinghouse’s  Oceanic  Divi¬ 
sion  (Annaoolis.  MD)  It  is  designed  to  locate  and  identify  ob¬ 
jects  on  the  ocean  floor  at  depths  of  4  miles. Currently,  com¬ 
puters  on  board  the  surface  vessel  handle  signal  processing,  data 
recording  and  analysis,  control  of  the  sidelooking  sonar,  and 
towing  operations.  Work  is  underway  at  W'cstinghouse  to  give 
more  autonomy  to  submersibles  The  emphasis  is  on  the  use  of 
onboard  expert  systems  and  sonar  image  processing.  Phil  Sch- 
weizer  of  W'cstinghouse,  together  with  Charles  Thorpe  and  Dave 
McKcow  n  of  Carnegie  Mellon  University,  have  developed  an  al¬ 
gorithm  for  autonomous  vehie’e  navigation.  It  matches  sets  of 
objects  on  sonar  images  with  known  landmarks. 

3.2  Robot  Vehicles 

Robot  vehicles  will  be  described  under  five  headings:  (1) 
legged,  (2)  indoor  and  flat-surface,  (3)  military  land  units,  (4) 
underwater,  and  (5)  airborne.  Although  none  of  the  ground  units 
described  is  fully  autonomous,  this  is  the  long  term  goal  of  their 
builders. 

legged  Vehicles 

Robert  McGhee  of  Ohio  State  University  considers  legged  ve¬ 
hicles  particularly  appropriate  for  travel  over  rough  or  mucky 
terrain.  He  has  built  a  small  six-legged  walking  machine  called  a 
hexapod  and  is  now  building  a  much  larger  unit  In  the  near 
future,  McGhee  believes  that  some  type  of  human  control  will  be 
required  to  guide  the  vehicle  over  terrain.2  One  approach  is  for  a 
human  to  point  to  terrain  spots  with  a  laser  and  for  the  vehicle’s 
sensor  system  to  detect  and  direct  the  vehicle  to  follow  the  spots. 
Although  in  experiments,  the  hexapod  has  already  been  demon¬ 
strated  in  an  autonomous  mode  with  onboard  computers  used 
for  supervisory  control,  McGhee  doesn’t  see  real  autonomy  con¬ 
ing  for  several  years.  He  anticipates  that  autonomous  navigation 
will  be  based  upon  active  range  data  used  to  form  a  local  eleva¬ 
tion  map,  and  will  not  utilize  visual  scene  analysis  —  at  least  not 
in  the  near  future. 

McGhee’s  larger  vehicle  will  have  12  Intel  8086/87  single  board 
computers  for  motion  planning  and  vehicle  control.  Specialized 
hydraulic  circuits  will  be  used  for  drive,  lift,  and  lateral  motion. 
Simple  nonscanning  proximity  detectors  may  be  used  for  local 
control  of  leg  motions.  Both  optical  and  acoustic  sensors  are  be¬ 
ing  studied  for  this  purpose. 

Odetics  has  built  a  large,  spindly,  radio-controlled,  six-legged 
walking  and  lifting  machine2  (figure  3-2).  it  has  a  clear  bubble 
"head"  with  built-in  TV  camera.  It  is  designed  to  w«.lk  over  rug¬ 
ged  terrain  and  handle  dangerous  materials. 

Research  into  walking  machines  is  also  taking  place  at  several 
other  locations.  Researchers  at  Moscow  University  have  built  at 
least  two  hexapods.  Marc  Raibert  of  Carnegie-Mellon  University 
has  built  a  one  legged  hopping  machine  designed  for  studying 
balance  control,  lean  Sutherland  of  CMU  has  built  a  six-legged 
human-controlled  vehicle.7  Shigeo  Hirosc  of  the  Tokyo  Institute 
of  Engineering  has  built  a  four-legged  spiderlike  machine  that 
can  climb  stairs. 

Indoor  and  Level-Ground  Vehicles 

Hans  Moravec  of  Carnegie-Mellon  University  has  been  build¬ 
ing  robot  vehicles  with  vision  systems  for  a  number  of  years2 
(figure  3-3:ij  The  vehicle  now  being  built  (figure  3-3b)  will  con- 
t  n  6  to  12  Motorola  68000  single  board  computers  and  a  TV 


ft 


Figure  3-2,  Odetics’  ()dex-l 

camera  mounted  on  a  pan-and-tilt  mechanism.  The  vehicle  runs 
on  three  small  wheels  allowing  high  mobility  over  a  flat  surface. 
Moravcc’s  extensive  experience  in  this  area  leads  him  to  believe 
that  the  initial  step  of  building  and  controlling  the  vehicle  is  not 
too  difficult  to  achieve,  but  it  is  very  hard  to  get  it  to  do  anything 
significant. 

Stanford  University  has  acquired  an  experimental  robot  carl 
from  Westinghouse’s  Unimation  Division.  The  Unimation  rover 
is  similar  ;n  design  and  function  to  the  CMU  rover.  It  achieves 
full  threc-degree-of-freedom  floor-plane  mobility  on  a  flat  sur¬ 
face. 

The  World  of  Robots  Corporation  is  developing  a  four- 
wheeled  Emergency  Security  Robot  (figure  3-4).  Some  of  its 
functions  have  been  demonstrated  in  a  remote  control  mode.  The 
de'  icc  is  intended  for  automatic  patroling,  sensing,  and  acting  on 
intrusion.  The  unit  has  five  types  of  sensors:  infrared,  audio, 
microwave,  ammonia  detector,  and  TV  camera.  The  vehicle  has  a 
number  of  modes  of  communication'  synthesized  speech,  lights 
and  sirens,  and  data  link.  Onboard  computers  will  be  used  for 
speech,  path  following,  defense,  and  reporting. 

A  number  of  other  robots  designed  for  travel  over  flat  surfaces 
arc  listed  in  table  A. 

Military  Land  Vehicles 

Military  robots  can  be  grouped  into  two  broad  classes:  static 
and  mobile.  A  static  robot  would  perform  the  tasks  of  point- 
defense,  surveillance,  and  communication  resource  manage¬ 
ment  A  mobile  robot  could  be  used  for  mine  detection,  surveil¬ 
lance;  and  target  detection,  recognition,  and  designation.  An 
intermediate  class  would  consist  of  vehicles  that  seek  high 
ground  with  a  clear  view,  and  stop  once  they  find  it. 

Several  groups  are  trying  to  robotize  existing  military  vehicles. 
The  interest  is  in  heavy  tracked  vehicles  like  tanks  and  APCs, 
which  can  go  over  small  obstacles.  Work  in  this  area  is  being 
done  by  Scott  Harmon  of  the  Naval  Oceans  Systems  Center, 
Rosa  Chang  of  EMC  Corp.,  Alexander  Mcystel  of  the  University 
of  Florida,  and  several  other  groups.  There  is  much  similarity  in 
the  approaches  being  taken  by  the  separate  groups  All  vehicles 
are  to  be  made  fully  autonomous  with  onboard  expert  systems 
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Figure  3-3a.  Stanford  Cart 

and  specialised  hardware  for  the  processing  of  visual  and/or 
range  images.  All  groups  are  using  Intel  single  board  computers 
for  general  purpose  processing.  Of  the  three  programs,  the  one 
by  Scott  Harmon  is  the  most  ambitious.  He  expects  to  have  a 
fully  working  autonomously  navigating  vehicle  by  I987. 

Hughes  Research  l  abs  is  building  a  test-bed  facility  for  devel¬ 
oping  robot  vehicle  software.  They  arc  taking  a  straight  Al  ap¬ 
proach.  They  arc  planning  to  transition  to  a  real  vehicle  in  1985- 
1986,  possibly  in  conjunction  with  FMC. 

The  Jet  Propulsion  Lab  vehicle  team  has  suspended  work  on 
the  Moon  and  Mars  rovers  (figure  3-5)  and  has  directed  their 
attention  to  robotizing  military  vehicles. 

Underwater  \ehlcles 

For  a  tethered  submersible,  the  drag  imposed  by  the  tcthcrline 
is  a  major  limiting  factor.  As  the  cable  length  is  increased,  the 
cable  must  be  made  thicker;  the  power  required  to  pull  it  in¬ 
creases  accordingly.  With  four  to  seven-milc-long  cables,  the 
shipboard  cable  handling  operations  become  massive  and  costly. 
Unmanned,  tetherless  subtncrsiblcs  are,  therefore,  particularly 
desirable  for  work  at  great  depths  About  a  dozen  vehicles  of  this 
typ.  are  described  in  the  open  literature.  The  basic  characteristics 
of  ten  of  them  arc  listed  in  table  C. 

An  autonomous  underwater  vehicle  must  be  able  to  determine 
its  own  location,  locate  obiccts,  and  perform  tasks.  It  must  have 
command  and  control  software  to  make  decisions  and  direct  op¬ 
erations,  since  a  data  link  is  not  possible  at  these  depths.  Due  to 
tnc  murktness  of  the  water,  vision  is  ordinarily  possible  only  at 
extremely  close  in  ranges.  Increased  range  can  be  obtained  by 
imaging  sonar  systems,  however,  image  resolution  and  quality 
are  limited 


An  example  of  the  state  of  the  art  in  autonomous  submersibles 
is  the  University  of  Ness  Hampshire's  LAVF-East  vehicle.1' 
HAVE -East  derives  all  its  information  front  onboard  sensors  and 
is  commanded  bs  an  onboard  expert  system  A  Motorola  SIJC  is 
used  for  command  and  control  This  master  computer  also  di 
reels  data  acquisition  and  recotding;  controls  still,  movie,  and 
slow-scan  IV  cameras,  and  a  imiltibcam  sonar  system.  Three 
dedicated  processors  are  also  on  board  to  perform  specific  nsks. 
One  controls  the  thruster  speed  and  vehicle  depth,  based  upon 
data  front  pressure  sensors.  Another  is  for  navigation  It  ana¬ 
lyzes  range  and  bearing  data  front  remote  acoustic  transponders 
The  third  dedicated  processor  handles  communication 
Air  Vehicles 

Existing  robot  ait  vehicles  are  of  the  "smart  missile"  type. 
They  arc  preprogrammed  with  navigation  and  targeting  informa¬ 
tion.  Various  types  of  active  and  'or  passive  sensor  data  are  used 
during  flight.  They  diftcr  fundamentally  from  the  vehicles  de 
scribed  previously  since  they  have  a  short  mission  and  explode  at 
the  end  of  it. 

A  cruise  missile  is  basically  any  unmanned  jet  aircralt  that  ad¬ 
justs  its  course  while  traveling  to  the  target. This  is  accom¬ 
plished  by  the  use  of  its  inertial  guidance  system  and  by  the 
matching  of  sensed  scenes  to  reference  maps.  One  approach  is  to 
compare  the  sensed  terrain  contour  with  digital  elevation  maps 
of  the  flight  path. 


Figure  3-3h.  CMl  Rover 
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Hgure  3-4.  World  of  Robots  Emergency  Security  Robot 


(Not  Fully  Operational) 

An  autonomous  tactical  missile  performs  target  detection  and 
recognition.  Terminal  homing  is  then  achieved  by  the  processing 
of  radar,  1 R,  or  visible  signals. 

Oi:c  example  of  the  state  of  the  art  is  the  French  Exocct  air  to- 
sca  missile.*’0  The  aircraft  carrying  the  Exocct  flics  low  over  the 
water  beneath  ihe  enemy's  radar  coverage.  The  pilot  then  pops 
the  aircraft  up  to  get  a  bearing  on  the  target,  and  then  drops 
down  again  below  the  enemy’s  radar  coverage.  Once  the  coordi¬ 
nates  of  the  target  arc  determined  and  passed  to  the  missile,  the 
missile  is  released  in  a  firc-and-forget  mode.  The  missile  heads 
toward  the  target  primarily  under  the  control  of  its  own  inertial 
guidance  system.  It  uses  its  radio  altimeter  10  travel  just  above 
the  water  surface.  As  the  missile  nears  the  target  it  rises  slightly, 
scans  the  horizon,  and  locks  its  terminal  homing  radar  onto  the 
target. 

Another  type  of  intelligent  weapon  system  is  being  developed 
in  the  U.S.  under  the  Assault  Breaker  Program.”  It  works  as 
follows.  A  missile  dispenses  “Skeet"  delivery  vehicles  high  above 
an  enemy  tank  concentration.  Each  delivery  vehicle  has  a  para¬ 
chute  that  opens  at  700  feet  to  slow  its  descent.  After  the  chutes 
are  released  at  100  feet,  the  delivery  vehicles  dispense  Skeet  sub¬ 
munitions.  Each  Skeet  attempts  to  locate  a  tank  and  fire  its  self¬ 
forging  projectile  at  it. 

4.  A  PROPOSED  TEST-BED  FACILITY 

The  Armys  Night  Vision  and  Electro-Optical  Laboratory  is 
providing  an  Attex  all-terrain  vehicle  (figure  4-1)  to  the  Autono¬ 
mous  Vehicle  Navigation  Program  This  vehicle  has  been  modi¬ 
fied  by  the  Systems  Integration  Division  or  NV&EOL  so  that 
controls  can  be  operated  electronically.  Wcstinghousc  will  add 
onboa.J  computers,  vision  system,  and  controls  to  convert  the 
vehicle  to  autonomous  operation. 

The  computer  equipment  for  this  project  will  be  configured  as 
two  half  racks.  The  two  halves  will  be  joined  together  during 
laboratory  development  The  lower  half  will  be  disconnected  for 
use  on  board  the  vehicle  when  needed  The  upper  rack  will  con¬ 
tain  the  following  equipment 


•  monitor 

•  interactive  terminal  and  keyboard 

•  W  inchester  disk  drive 

•  three  floppy  disk  drives. 

The  lower  rack  will  contain  the  following  equipment: 

•  Westinghou-c  real-time  gray  level  vision  system 

•  Intel  chassis,  with  Intel  single  board  computers  (SBCs), 
core  and  bubble  memories,  and  imerfaee  boards. 


(»)  JPL  Mars  Rover  Hardware  Prototype, 


(b)  JPL  Mars  Rover  Software  Prototype, 


(e)  JPL  Lunar  Rover 


Figure  3-5,  Jet  Propulsion  l  ab  Rovers 
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The  vehicle  nu.i  a  telescoping  mast  upon  which  will  be  placed  a 
Fairchild  solid-state  TV  camera  and  pan-and-tilt  mechanism 
Other  sensors,  such  as  laser  range  imager,  FUR,  acoustic  ranger, 
and  radar  will  be  investigated  for  inclusion  The  vehicle  will  also 
contain  more  mundane  equipment  to  make  the  system  work, 
such  as  power  source  and  a  remote  control  unit  for  emergencies 
and  test 


Figure  3-6.  Quadraclor 


Figure  4-1.  Altex  All-Terrain  Vehicle 

The  lower  rack  will  be  described  in  more  detail.  The  Intel 
chassis  will  initially  include  two  Intel  8086/87  SBCs.  Faster,  plug 
compatible  SBCs  are  being  developed  by  Intel  each  year.  They 
could  be  substituted  or  added  if  required.  One-half  megabyte  of 
core  memory  will  be  present.  Five  2-mcgabytc  bubble  memory 
boards  will  be  used  for  storing  the  data  base.  Bubble  memory  is 


nonvolatile.  Two  megabyte  boards  should  be  available  in  1984 

A  Westinghousc  AUTO-Q  vision  system  will  be  used  It  per 
forms  one  billion  operations  per  second  to  process  four  million 
pixels  per  second.  It  outputs  data  of  three  forms:  edge  vectors  of 
various  lengths,  blob  data,  and  area  histograms. 

Westinghousc  is  continuing  to  design  new,  more  powerful  vi¬ 
sion  systems.  One  unit  now  in  development  will  produce  optical 
flow  vectors  at  real-time  rates.  It  may  be  used  later  on  in  this 
program. 


5.  CONCLUSIONS 

Increased  interest  is  being  expres  ed  in  the  development  of 
truly  autonomous  mobile  robots.  The  advent  of  faster  single¬ 
board  computers  and  denser  memories  has  all  but  eliminated 
constraints  on  processing  speed  and  storage  capacity.  Dedicated 
gray-level  vision  systems  are  now  available  to  do  low-level  image 
processing  in  real-time.  VF1SIC  hardware  will  soon  be  available 
Expert  systems  have  been  developed  which  could  be  adapted  to 
vehicle  command  and  control. 

If  all  this  is  true,  then  why  aren’t  robot  vehicles  ready  for  full- 
scale  development  and  deployment?  One  answer  is  that  the 
middle-level  vision  problem  has  not  yet  been  solved.  No  software 
package  exists  for  transforming  low-level  features  into  identified 
objects  with  known  geometric  relationships.  No  one  has  come 
even  close  to  so’ving  this  problem  It  remains  to  be  seen  what 
progress  will  be  made  toward  the  solution  in  this  decade. 

Our  plan  is  to  construct  a  state-of-the-art  test-bed  facility  for 
algorithm  evaluation.  Since  the  Intel  SBC  appears  to  be  emerging 
as  j  standard  for  robot  vehicles,  it  will  be  possible  to  evaluate 
software  developed  elsewhere  as  well  as  software  specifically 
written  for  this  project  at  the  University  of  Maryland  and  West- 
inghouse.  Initial  research  will  involve  setting  simple  goals  for  ve¬ 
hicle  navigation  within  a  confined  outdoor  environment  and  then 
trying  to  achieve  these  goals.  The  best  hope  for  progicss  lies  in 
the  cooperation  and  exchange  of  results  among  the  various  re¬ 
search  groups  working  in  this  area. 
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Table  A.  Land  Vehicles 


NAME 

ORGANIZATION 

NUMBER 

BUILT 

COST 

SIZE 

WEIGHT 

PAYLOAD 

STATUS 

0NB0AR0 

COMPUTERS 

POWER 

SENSORS. 

VISION 

SYSTEM 

COMMENTS 

Westing- 

house 

Power  Plant 
Vehicle 

m 

t  built 

1  bought 
trom  PaR 
tnd 

moditied 

PaR  Vehicle 
cost  $30  000 
lo  modily 

t  small 

1  large 
with  600  lb 
payload 

Working 

systems 

None 

Two  Vi  HP 
DC  motors 

AC  via 
trailing 
cord 

■ 

Tracked.  Umbilical 
cord  Solid  State 
control  system 
ac/dc  converter 

Rescue 
Vehicles  lor 
Nuclear 

Power  Plants 

GCA/PaR  Systems 
Redwing.  MINN 

Will  Bochike 

6 1 2  484  7261 

15 

$100X00 

minimum 

5001b 

payload 

1000  b 

Built  lo 
order 

None 

Power  via 

trailing 

cable 

TV  cameras 

Umbilical  cord 

For  nuclear 
industry  Tracked 
vehicles  Mounted 

TV  cameras 

Mining  & 

Disarming 

Umls 

MBA  Associates 

San  Remcne.  CA 
Marketing  Dept 
415-837  7201 

20 

$150,000 
ar  1  up 

Various 

Built  to 
order 

None 

TV  cameras 

Steered  via  radio 
link.  TV  camera  & 
video  link  Tracked 
vehicles,  some  fnr 

AF  ordnance  retrieval 

Odex  t 

Odetics.  Inc 

1380S  Anaheim 

Blvd  .  Anaheim 

CA  92805 

Joe  Slutzky 

7t4  774  5000 

1 

$200,000 

36  tall.  370  lb 
payload  when 
stationary.  1000 
lb  payload  when 
walking 

Prototype 

built 

24V  aircraft 
battery 

360  W-H 

TV  cameras 

6  legs  1  micro 
lor  each  leg  Re 
mote  control,  r  I 
link,  joystick.  Out 
door  >.ehicie. 

RPI  Mars 

Rover 

Rensselar  Poly 

Inst  ECSE  Dept 
Troy.  NY  12181 
David  Gisser 

518  270  6485 

1 

Sire  ol  subcom- 
pact  car,  600 
lbs 

Not  completely 
operational. 

jmiied  on¬ 
board  proces¬ 
sing  plus  high 
frequency  data 
ink  io  main¬ 
frame  computer 

4  6  12V 
batteries 

Triangulating 
system  using 
solid  slate 
pulsed  laser 
rotating  mirror, 
rotating  mast 

Current  ellort 

emphasizes 

vision. 

Emergency 

Security 

Robot 

World  ol  Robots 

Co  2335  E  High 

Si  Jackson 

Mich,  49203 
800-248  0896 

1 

Undetermined 

■  :  ■ 

n  research 
and  develop¬ 
ment 

Yes 

Operates 
or  12  hrs 
telore 
batteries 
need  re 
echarging 

5  sensors 

IR  TV  micro- 
wave  audio  & 
ammonia 

Visual  change 
detection 

4  Ig,  wheels. 

Oaia  link 
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lal>U  \  land  Vehicle's  it  oniinmd) 


Heathkit 
Hero  t  & 
Syslem 
Software 


Sumitomo 

Android 


$3500  tor 
human 
controlled 
vehicle 
$25  000 
lor  elec¬ 
tronic 
steering 
&  braking 


Ohio  State  Umv 
Dept  ol  Eiect 
Eng  .2015  Neil 
Ave  ColumDus 
OH  43210  R  B 
McGhee 
614  422-2820 


Battelie  Labs 
505  King  Ave 
Columbus.  OH 
43201  B  Brown- 
stem  J  Reidy 


TRW  One  Space 
Park.  MS  021769 
Redondo  Beach 
CA  90278  Mark 
Thomsen.  M 
Sherbrmgs 
213-535-1706 


Heathkit 
616-982-341 1 


Sumitomo  Elec¬ 
tric  Ind  Osaka 
Japan 


Carnegie  Mellon 
Umv  Pittsburgh  P t 
15213  Rat  Reddy 
Hans  Mora-  ; 
412-578-3829 


1  lab  mod 
el  built. 

1  being 
built  by 
1984 


$50  000 
lor  parts 
S 250  000 
tor 

personnel 
S 300  000 
lor  systems 


Sherry  Products 
Inc  .  t501  Raci- 
lie  Ceast  Hwy 
Hermosa  Beach 
CA  90254 
Steve  Sherry 
213  379-8457 


5  1/2  It 
tall.  6  tl 
wide.  15  it 
long  15000 
lbs  several 
thousand 
pound 
payload 


64”  tall 
G3"  wide 
86"  long 
900  lbs 


Current 
hexapod  is 
small  New 
one  will 
weigh  9000 
lbs 


Beginning 
3  •  4  yr 
program 


ONBOARO 


20  SBCs 
Intel  8088 
etc  array 
processor 
Loosely 
coupled 
network 
PL/M  Pascal 


4  cycle  8 
hp  Briggs 
&  Stratton 
gas  engine 


New  unit  New  unit 

will  be  12  8086 

completed  SBCs 

m  1984  •  1986  Pascal 


Based  upon 
Heathkit 

Hero-1 

Yes  see 
Hero- 1 
entry 

39  lbs 

For  sale 

Motorola 

6808 

micropro¬ 

cessor 

36"  high 

20"  wide 

39"  deep 

Prototype 

built 

Yes 

60  cm  dia- 

Mechanic- 

6  Motorola 

meler  100 

Jhy  com- 

68090  SBCs. 

cm  tall 

pieie 

10  MC  6805 

100  kg 

processors 

running 

low-ievei 

software 

partially 

tested,  vision  & 

high  level  control 

ellods  proceedng 

in  prahei  witti 

hateware 

cc.nuieiion 

lor  servo 

control 

language 

'C  ' 

28”  wide 

39"  long 

500  lb 
payload 

For  sale 

None 

4  Rechar¬ 
geable 
batteries 


Batteries 
&  DC-10- 0( 
converter 
600  WH 
1200  WHc 
24  00  WH 


SENSORS 

VISION 

SYSTEM 

COMMENTS 

Acoustic 
sensors  laser 
range  data 
Vision  syslem 
planned 

Ma|Or  ellort  Point 
to  point  autonomous 
travel  planned 

None 

4  tractor  wheels. 

4  wheel  sieermg 

4  wheel  drive  I'exibie 
frame,  powered  sus¬ 
pension  Wheels  are 
on  leg  assemblies 
(see  figure  3-6) 

Current  uni! 
has  2  GE  CIO 
cameras  Range 
sensors  II 
advanced 
vision  system 
is  added  it 
will  be  in  1986 

Current  unit  is  an 
indoor  vehicle  New 
unit  will  bean  out 
door  vehicle  Both 
have  6  legs 

Sensors  being 
investigated 
by  Brian 

Ke'ly  ot 

Pi'telle 

Guidance  and  taviga- 
lion  algorithms 
under  study  at 

Battelie 

6  Polaroid 

acoustic 

range-lmders 

Emphasis  is  on  soil- 
ware  lo  build  a 
world  model 

Sound  and 
light  sensor. 

3  wheels  Gripper 
Voice  synthesizer 

A  leaching  toot 

2  liber 
cptic  array: 

Gripper  The  objec¬ 
tive  is  elertromc 
parts  assembly 

TV  camera 
with  pan  and 
tit'  mechan¬ 
ism  Will  be 
vision  con¬ 
trolled 

1 

3  degree-nl-ireedom 
ground  plane  mobil¬ 
ity  3  sm  wheel  ass 
emblies.  each  able 
to  steer  and  drive 

Oata  link  to  VAX 
11/780 

None 

From  wheel  steering 
controlled  by  magne 
lie  pots  Sturdy  out 
door  vehicle  can 
climb  hills  has  11" 
tires  Joystick  con 
trolled.  Can  be 
easily  robotized 

I ■1*1**  A.  I.uml  Nullifies  ((  ontiniu-il) 


NAME 

ORGANIZATION 

NUMBER 

BUILT 

COST 

SIZE 

WEIGHT 

PAYLOAD 

STATUS 

ONBOARD 

COMPUTERS 

POWER 

SENSORS 

VISION 

SYSTEM 

COMMENTS 

RB5X 

R  B  Robot  Corp 
•Suite  20t.  14618 
W  6th  Ave 

Goid»n  CO  80401 

S 1 495 

13"  dia 

24"  tall 

10  lbs 

For  sale 

Micro 
processor 
with  16  K 
bytes  ol 
core 

2  rechar 
geabie  6V 
gelled 
electro¬ 
lyte 

batteries 

Voice  recog¬ 
nition  Sonar 
sensor  bum¬ 
per  switch 

RS-232C  port  A  toy 

Atie*  ah 
Terrain 
Vehicle 
called 
Tomahawk 

Vehicle  built  by 
Attex  tnt  Inc 

6168  V/oodbme 
Ave  Ravenna  OH 
44266 

216-2970077 
Electronic 
controls  added 
byNV&EOL 

Many 

human 

controlled 

vehicles 

1  robot 
vehicle 

82"  long 

56"  wide 

42"  high 

700  lbs 

7001b 

payload 

Electron 
icatly 
controll¬ 
able  vchi 
cic  exists 
at  NV&EOL 
Three  year 
autonomous 
navigation 
program 
started 

Inlet  8086 
SBCs  core 
and  bubble 
memory  to 
be  added 
by  West- 
mghouse 

16  HP  lour 
cycle  gas 
engine 
Batteries 

Sensors  and 
vision  system 
to  be  added 
by  Westing- 
house 

Vehicle  has  6  targe 
wheels  telescoping 
mast  Test  bed  lac 
ility  U  Maryland 
is  prime  contractor 

BOB 

Androbots  tnc 

1287  Lawrence 
Station  Rd 
Sunnyvale.  CA 
94036 

Frank  Jones 

S2500 

3  It  tail 

For  sale 

3  micro¬ 
processors, 
up  to  3 
megaoytes 
ot  onboard 
memory 

Toy 

Hughes 

Testbed 

Hughes  Research 
Labs  Malibu  CA 
90265  Oavid 

Tseng 

Bruce  But'ock 

Several 

small 

testbeds 

Low 

3  It  tong 

1  1/2  ft 
wide 

Currently 
testbed 
for  soft¬ 
ware 

shakedown 

To  transi¬ 
tion  to 
real  vehi¬ 
cle  in 

(985  1986 

Telemetry 
link  to 

Lisp  pro¬ 
cessor  and 
vision 
system 

Vidicon  cam 
era  (also 
stored  maps) 
Polaroid  so¬ 
nar  sensors 
Multiprocess¬ 
or  vision 
system 

At  algorithm  devet 
opment  Situation 
assessment  lusion 
ol  sensor  data 
script  and  plan  lix 
mg  knowledge  ac¬ 
quisition  S  repre 
sentation  goat 
driven  message  plan 
mng  DMA  terrains 
culture  dgitaldata 

Lunar  & 

Mars 

Rovers 

Cal  Inst  Tech 

Jet  Propulsion 

Lab  4800  Oak 
Grove  Or  Pasa 
dena  CA  9t  t09 

Carl  RuoM  2 f 3- 
354-6101  4864 

Ken  Holmes 

3 

High 

Lunar 

Rover  6ft 
tong  t8” 
wide  16" 
high 

120  lbs 

Mars  Rover 

Software 

P'ototype 

4  tt  wide 

7  ft  long . 

6  ft  high 

300  tbs 

Mars  Rover 

Hardware 

Prototype 

6  ft  long 

4  ft  wide 

3  ft  high 

300  tbs 

Lunar  and 

Mars  rover 

missions 

postponed 

Work  com 

mencing  on 

autonomous 

military 

vehicles 

(funded  by 

Army) 

No  rover 
has  on¬ 
board  com¬ 
puters 
since  pro¬ 
gram  did 
not  advance 
lo  (light 
haroware 
stage  On¬ 
board  elec¬ 
tronics 
does  exist 

Lunar 

Rover  auto 
batteries 

Mars  rover 

SP  Recti¬ 
fied  1 20V 
Mars  Rover 
HP  auto 
batteries 

Lunar  Rover 
Remote  con¬ 
trolled  TV 
camera 

Mars  Rover  SP 
ste.eo  compu 
tei  vision 
with  correla¬ 
tion  tracking 
computation 
done  in  lab 
computer 

Mars  Rover 

HP  No  vision 
system 

Lunar  Rover  teleop 
erated.  3  arlicutai- 
id  modules  *ach  with 

2  wheels 

Mars  Rover  S  P 
controlled  via 
tetherline  to  tab 
computers,  4  steer¬ 
able  auto  wheels 

Mars  Rover  HP  remote 
control  via  radio 
tmk.  •*  tracks 
(loop  wheels) 

Robot 

Ml  13 

A  PC 

EMC  Corp  1105 
Coleman  Ave 

San  Jose  CA 

Rosa  Chang 
408-289-2850 

Many 

human 

controlled 

vehicles 

Expensive 

Large 

military 

vehicle. 

Ongoing 

research 

Intel  8086 

Vision  system 
o  be  added 
n  1984 

nloimal  agreement 
with  Hughes 

Big  Track 
Testbed 

Univ  ol  Florida 

137  Larsen 
Gainesville  FI 

32611  Alexander 
Meystel  904- 
392  4964 

I  table- 
op  unit 

low 

Very  smalt 

Ongoing 

research 

Sinclair 
home  compu 
ler  POP- 
ft  '34 

Pascal 

-’olaroid 
ange  sensors 

Mgorithm  Oevetop- 
nent  ( 1 )  Planner 

2)  Navigator 

3)  Pilot 

_ J 

Table  II.  Remotely  Piloted  Vehicles 


NAME 

ORGANIZATION 

Aquila 

Lockheed  air 

Irame  Westing 
house  mission 
payload 

Mastilt 

Tadiran  Israel 

Elec  Industries 

1 1  Ben  Gurion 

Si  Fival-Shmuel 

PO  Box  648  Tel 

Aviv  61006 

Israel  03713111 

Scout 

Israel  Aircratt 

Ind  Ben  Gurion 
inti  Airport. 

Israel 

2!2  620-4400 

NUMBER 

BUILT 


SIZE 

WEIGHT 

PAYLOAO 


6'  10". 

12'  9" 
wingspan. 
220  lbs 
60  lb  pay¬ 
load  +  25 
lbs  tn 
parachute 
bay 


10'  9"; 
14'  2" 
wingspan. 
253  lbs 
66  lb 
payload 


STATUS 


In  production 


ONBOARO 

COMPUTERSIPROPULSION 


Navigation 
computer  4 
inertial 
system 


Nat  .atioii. 
ground  con¬ 
trol  auto 
pilot 


Autotrack 
GCS  manual 
autopilot 
preprogram¬ 
med. 


Rear  moun¬ 
ted  26  hp 
engine 
with  push 
propeller 


Mid  moun¬ 
ted  22  hp 
engine 
with  push 
propeller 


Mid  moun¬ 
ted  18  hp 
engine 
with  push 
propeller 


SENSORS 


Stabilized  TV 
camera 
3  tields-ol- 
view  auto¬ 
tracker  laser 
rangelmder 
laser  designa 
lor  FUR  to 
be  added 


Stabilized 
gimballed  TV 
camera  panor 
amic  camera 
lor  stilt 
photography 


Gyrostabilized 
gimballed  TV 
camera,  with 
t  15  zoom  and 
a  2  tields-ol- 
view  panoram¬ 
ic  camera  lor 
still  photog 
raphy 


COMMENTS 


Anti-jam  spread 
spectrum  data  link 
very  small  radar 
profile 


Table  C.  Tclhcrlcss  I'nmunncd  Vehicles 


NAME 

ORGANIZATION 

Deep  Mobile 
Target 

EMI  Lid 

United  Kingdom 

EAVE  East 

Umv  ol  New  Hampshire 
Marine  Program 

Marine  Program  8idg 

Our  ham.  NH  03824 

Oick  Blidberg 

603  862  1234 

EAVE  West 

Naval  Oceans 

Systems  Center 

San  Oiego  CA  92152 
Bou  Wernli 

Epaulard 

C  N  E  X  0  France 

Ocean  Space 
Robot 

Mitsui  Shipbuilding 
and  Engineering  Cn 
Tokyo  Japan 

PAP  !04 

Societe  104 

Meudan.  France 

Robot  II 

M  1  T  Oept  ol  Ocean 
Engineering 

Cambndge 

MA 

Sell 

Propelled 

Underwater 

Research 

Vehicle 

Applied  Pfysics  Lab 
Seattle  WA 

Unmanned 

Arne 

Research 

Submersible 

Applied  Physics  Lab 
Seattle  WA 

i 

UFSS 

Naval  Research  Lab  f 
Washington  DC 

LENGTH 

WEIGHT 

PAYLOAO 


MAXIMUM 

DEPTH 


SENSORS 


Sonar 


MAXIMUM 

OEPTH  SENSORS 


366  m 


Very  low  fre¬ 
quency  radio 
navigation 


COMMENTS 


For  ASW  and  Training 


Pipeline  Instrumentation  piattorm 
Wilt  contain  an  expert  system. 


6100  processor,  and  bubble 
memory 


Preprogrammed  Surveys  mining 
sites  Brings  stereo  pairs  10 
surlace 


Wire  guided  submersible  lor 
sea  bed  exploration  and  the 
identification  ol  underwater 
objects 


Long  range  autonomous 
vehicle 
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Allen  U.  Waxman 


Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 


ABSTRACT 


This  study  concerns  a  new  formulation  and 
method  for  solution  of  thd  "image  flow  problem.1' 

It  is  relevant  to  the  maneuvering  of  a  robotic 
system  through  an  environment  containing  other 
moving  objects  and/or  terrain.  The  two-dimensional 
image  flow  is  generated  by  the  relative  rigid  body 
motion  of  a  smooth,  textured  object  along  the  line 
of  sight  to  a  monocular  camera.  By  analyzing  this 
evolving  image  sequence,  one  hopes  to  extract  the 
instantaneous  motion  (described  by  six  degrees  of 
freedom)  and  local  structure  (slopes  and  curva¬ 
tures)  of  the  object  along  the  line  of  sight.  The 
formulation  relates  a  new  local  representation  of 
an  image  flow  to  object  motion  and  structure  by 
twelve  nonlinear,  algebraic  equations.  The  rep¬ 
resentation  parameters,  termed  "observables",  are 
given  by  the  two  components  of  image  velocity, 
three  components  of  rate-of-strain,  spin,  and  six 
independent  image  gradients  of  rate-of-strain  and 
spin,  evaluated  at  the  point  on  the  line  of  sight. 
This  representation  is  motivated  by  the  deforma¬ 
tion  of  a  finite  element  of  flowing  continuum.  A 
method  for  solving  these  equations  was  devised  and 
successfully  implemented  on  a  VAX-750  computer.  A 
number  of  examples  were  explored  revealing  two 
classes  of  ambiguous  scenes  (i.e.,  nonunique  solu¬ 
tions  are  obtained).  A  sensitivity  analysis  was 
also  begun  in  order  to  estimate  noise  levels  in 
the  representation  parameters  which  still  yield 
acceptable  solutions.  Estimates  of  computing  time 
required  for  this  approach  to  image  flow  analysis 
indicate  that  real-time  implementation  is  not  out 
of  the  question. 


1.  INTRODUCTION 


in  the  images  themselves,  but  rather  in  the  rate- 
of-change  of  the  image.  It  is  our  aim  to  invert 
the  image  flow  along  a  line  of  sight  and  thereby 
determine  the  motion  and  local  structure  of  an 
object  under  view. 

Of  course,  the  image  flow  problem  has  its 
counterpart  in  the  realm  of  visual  perception  by 
man  and  animals;  that  is  the  optical  flow  of 
Gibson  (1966)  (also  see  Marr  1982).  However, 
here  we  shall  concentrate  not  on  the  biological 
issues  of  perception,  but  rather  on  an  appropriate 
representation,  mathematical  formulation,  and 
solution  of  the  image  flow  problem.  By  "appropri¬ 
ate  representation"  we  mean  the  set  of  observables 
which  describe  a  local  image  flow  and  which  lead 
to  a  useful  formulation  of  the  problem,  admitting 
a  rapid  and  stable  solution.  But  in  addition,  one 
must  be  able  to  extract  these  observables  from  the 
evolving  image  in  an  efficient  manner.  This  last 
point  is  often  ignored,  taking  the  instantaneous 
velocity  field  over  the  entire  image  as  given. 
Thus,  Prazdny  (1980)  tried  to  compute  the  relative 
motion  and  depth  map  of  a  set  of  five  points  (mov¬ 
ing  as  a  rigid  body)  directly  from  the  two  compo¬ 
nents  of  image  velocity  associated  with  each  point 
(assumed  to  be  given).  As  expected,  this  method 
failed  when  the  points  were  close  to  each  other 
for  then  the  image  velocities  of  the  different 
points  were  all  very  similar,  and  computational 
difficulties  associated  with  round-off  errors 
arose.  Therefore,  this  method  could  not  be  used 
to  discern  local  object  structure.  Moreover,  the 
availability  of  accurate  velocity  measurements  for 
many  points  in  an  image  is  not  something  one  can 
take  for  granted.  Prazdny's  study  does,  however, 
point  to  the  importance  of  image  velocity  gradi¬ 
ents. 


This  study  pertains  to  the  field  of  real-time 
dynamic  image  processing  with  regard  to  the  maneu¬ 
vering  of  a  robotic  system  through  an  environment 
containing  other  moving  objects  and/or  terrain. 

For  a  robot  to  accomplish  this  task,  it  may  need 
to  determine  the  three-dimensional  structure  and 
relative  rigid  body  motions  of  these  objects,  and 
it  must  extract  this  information  from  a  two- 
dimensional,  evolving,  monocular  image  field  in 
real  time.  That  is,  the  motion  in  space  of  a  rigid, 
textured  object  creates  an  image  flow  at  the  camera. 
The  information  in  an  image  flow  is  contained  not 


Our  choice  of  the  "appropriate  observables" 
is  motivated  by  the  analogy  of  a  local  image  flow 
with  the  deformation  of  a  finite  element  of  flow¬ 
ing  fluid.  As  is  well  known  in  continuum  mechanics , 
the  deformation  of  an  infinitesimal  element  of 
flowing  continuum  may  be  specified  by  the  velocity- 
gradient  tensor,  the  symmetric  and  antisymmetric 
parts  of  which  have  clea/  ’eometric  interpretations 
and  are  termed  the  "rate-o. -strain"  and  "spin"  ten¬ 
sors,  respectively.  This  description  is  manifest 
in  the  Cauchy-Stokes  Decomposition  Theorem  (Aris 
1962),  and  may  be  applied  to  an  infinitesimal  area 
in  an  image  flow  as  long  as  the  local  object 
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structure  is  sufficiently  smooth.  Such  considera¬ 
tions  were  first  applied  to  image  flows  by 
Koenderink  and  van  Doom  (1975,1976)  who,  however, 
made  no  attempt  to  implement  them  in  a  computa¬ 
tional  scheme.  Moreover,  the  image  velocity- 
gradient  tensor  does  not  reflect  the  curvature  of 
an  object  along  the  line  of  sight,  nor  does  it 
provide  a  sufficient  number  of  relationships  to 
determine  all  the  unknowns  (eight  in  this  case). 

In  order  to  obtain  local  object  curvature  as  well 
as  slope,  and  the  relative  rigid  body  motion  (which 
constitutes  a  total  of  eleven  unknowns),  one  must 
consider  the  six  independent  gradients  of  the 
rate-of-strain  and  spin  as  well.  That  is,  the 
second  derivatives  of  the  image  velocity  field  are 
also  necessary.  This  would  provide  us  with  a 
total  of  twelve  nonlinear,  algebraic  relations 
between  the  six  parameters  of  motion,  five  struc¬ 
ture  parameters  and  the  twelve  observables,  which 
in  principle  can  be  solved.  Similar  ideas  have 
appeared  in  a  paper  by  Longuet-Higgins  and 
Prazdny  (1980),  but  they  too  made  no  attempt  to 
implement  them  in  order  to  test  the  feasibility 
of  their  approach.  They  did,  however,  outline  a 
method  of  solution  which  hinges  on  the  existence 
of  the  focus  of  expansion.  Unfortunately,  their 
method  toes  not  work  for  planar  surfaces,  but  in 
addition  it  is  inapplicable  to  any  object  not  pos¬ 
sessing  a  relative  velocity  of  approach  to  (or 
recession  from)  the  observer, for  then  a  unique 
vanishing  point  does  not  exist.  The  fact  that  cer¬ 
tain  aspects  of  object  curvature  are  manifest  in 
the  second  derivatives  of  image  velocity  had  al¬ 
ready  been  noted  by  Koenderink  andvanDoorn  (1975), 
relating  the  sign  of  the  Gaussian  curvature  to  the 
gradients  in  the  invariants  associated  with  the 
rate-of-strain  and  spin  tensors. 

The  fact  that  our  formulation  incorporates  the 
gradients  of  the  rate-of-strain  and  spin  tensors 
implies  that  we  are  indeed  analyzing  the  image 
flow  in  a  finite  neighborhood  of  the  line  of  sight. 
Just  as  the  Cauchy-Stokes  decomposition  describes 
the  deformation  of  an  infinitesimal  area  in  the 
image,  the  gradients  of  this  decomposition  extend 
the  kinematic  analysis  to  a  small  but  finite  area 
of  image.  Moreover,  the  potential  for  a  geometric 
interpretation  of  these  higher  derivatives  remains, 
and  this  is  crucial  for  extracting  the  twelve  ob¬ 
servables  from  an  evolving  image  element .  It  is 
our  expectation  that  the  observables  we  utilize 
here  manifest  themselves  in  the  rate-of-deformation 
of  the  so-called  "zero-crossing  curves"  of  the  image 
intensity  distribution  (Marr  1982).  Preliminary 
work  on  idealized  curves  shows  this  to  be  the  case. 

The  approach  we  have  developed,  like  those 
described  above,  is  applicable  only  to  smooth 
structures,  i.e.,  finite  slopes  and  curvatures,  and 
so  cannot  be  used  directly  near  a  boundary  or  cusp 
on  an  object.  (A  multi-resolution  implementation 
on  smoothed  images  should  obviate  most  of  these 
limitations.)  Moreover,  the  whole  philosophy  of 
image  flows  is  applicable  only  to  objects  with 
texture  or  features,  for  otherwise  no  image  defor¬ 
mation  would  be  observed  except  near  an  object's 
boundaries . 


In  Section  2,  we  present  a  systematic  deriva¬ 
tion  of  the  twelve  kinematic  equations  relating 
object  structure  and  motion  to  our  representation 
of  the  image  flow.  The  method  of  solution  is  de¬ 
veloped  in  (Waxman,  1983)  where  it  is  found  that 
the  solutions  divide  into  two  families,  one  of 
which  is  shown  to  be  non-unique, possessing  a  two¬ 
fold  ambiguity.  Our  method  of  solving  the  twelve 
nonlinear  algebraic  equations  exploits  a  trans¬ 
formation  of  the  image  screen  coordinates  which 
aligns  one  image  axis  with  the  direction  of  zero 
slope  on  the  object  at  the  point  of  observation. 

The  required  transformation  angle  is  itself  a  part 
of  the  complete  solution  to  the  image  flow  equa¬ 
tions.  (Waxman,  1983)  also  presents  an  alternative 
(and  more  rapid)  method  for  obtaining  this  trans¬ 
formation  angle  from  an  additional  piece  of  data 
in  the  form  of  a  "radial  collision  time."  This 
"collision  time"  represents  the  distance  to  the 
object  along  the  line  of  sight  divided  by  the 
relative  speed  of  approach;  one  could  imagine  ob¬ 
taining  it  from  a  laser  Doppler  shift  and  ranging 
apparatus.  A  number  of  example  calculations  de¬ 
monstrating  the  feasibility  and  stability  of  the 
method  are  considered  in  (Waxman,  1983)  where  it 
is  noted  that  planar  surfaces  in  motion  also  re¬ 
veal  a  two-fold  ambiguity. 

2.  DERIVATION  OF  THE  "KINEMATIC  RELATIONS" 

The  motion  of  a  rigid  body  in  space  may  be 
uniquely  specified  (in  some  inertial  reference 
frame)  by  assigning  three  independent  components 
of  translational  velocity  and  three  components  of 
rotational  velocity  to  any  point  within  or  on  the 
bounding  surface  of  the  object.  We  shall  adopt  the 
point  on  the  surface  of  the  object  under  view 
which  intersects  the  line  of  sight  from  the  obser¬ 
ver.  As  is  usually  done  in  image  flow  studies,  the 
object  will  be  treated  as  stationary  with  the 
relative  motion  through  space  ascribed  to  the 
observer.  The  kinematic  equations  to  be  derived 
here  relate  this  rigid  body  motion  and  the  struc¬ 
ture  of  the  object's  bounding  surface  in  the  neigh¬ 
borhood  of  the  line  of  sight  to  our  representation 
of  the  local  image  flow. 

Following  Longuet-Higgins  and  Prazdny  (1980), 
we  adopt  the  coordinate  systems  shown  in  Figure  1. 
The  vertex  of  perspective  projection  is  located  at 
the  origin  of  a  spatial  coordinate  system  (X,Y,Z) 
whose  Z-axis  is  oriented  along  the  instantaneous 
line  of  sight  directed  at  the  object.  This  moving 
coordinate  system  has  three  degrees  of  translational 
freedom  (vx>vy>vz^  anci  three  degrees  of  rotational 
freedom  (fly  ,(2y  •  The  two-dimensional  image  to 

be  analyzed  is  created  by  the  perspective  projection 
of  the  object  and  environment  onto  a  planar  screen, 
oriented  normal  to  the  Z-axis  and  intersecting  it 
at  Z-l.  The  origin  of  the  image  coordinate  system 
(x,y)  on  the  screen  is  located  in  space  at  (X,Y,Z)= 
(0,0,1).  Thus,  a  point  P  in  space,  located  by 
position  vector  R,  projects  onto  the  screen  as  point 
p  as  shown  in  Figure  1. 

Due  to  the  motion  of  the  observer,  the  rela¬ 
tive  motion  of  point  P  in  space  is  -(V+£2*R)  .  In. 
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z-2°^)0x  +  ^)0y  +  M^), 


component  form  we  express  P’s  motion  through  space 
as 


x  -  -vx  -  qyz  +  nzY  , 


(la) 


(4b) 


Y  -  -VY  -  nzX  +  f!xZ  , 

z  -  -v„  -  nvY  +  nvx  . 

£.  X  Y 


(lb)  as  follows.  By  replacing  X  and  Y  in  series  (4) 
according  to  equations  (2)  and  substituting  the 
whole  of  (4a)  back  into  the  right-hand-side  of 

(lc)  (4a)  wherever  Z  appears,  we  find,  after  collecting 
terms  and  comparing  with  (4b),  that 


The  projected  coordinates  of  P  on  the  screen  (loca¬ 
ted  at  Z=l)  are  given  by 

x  ■  X/Z  and  y  «  Y/Z  £,) 

The  corresponding  velocities  of  the  image  point  P 
are  simply  (vx,vy)=(x,y) ,  obtained  by  differenti¬ 
ating  expressions  (2)  with  respect  to  time  and 
utilizing  relations  (1).  Hence,  the  image  flow 
field  is  given  by 


'*)• 

£xyfix  -  (1  +  x1)  dY  +  Ynzj  • 

(3a) 

*}* 

[(1  +  y1)  «x  ■  xYny  '  x(!z]  ' 

(3b) 

We  can  already  see  from  equations  (3)  that  the 
image  flow  reflects  neither  the  absolute  distance 
to  points  on  an  object,  nor  the  absolute  trans¬ 
lational  velocities  through  space;  both  will  be 
scaled  by  the  distance  to  the  object,  say,  along 
the  line  of  sight. 

In  order  to  discern  the  local  structure  of  the 
object  in  the  neighborhood  of  the  line  of  sight, 
we  shall  perforft,  a  kinematic  analysis  of  the  image 
flow  in  the  vicinity  of  the  image  origin  (x,y)=(0,0). 
But  in  doing  so,  we  shall  need  to  take  various  deri¬ 
vatives  of  the  flow  (3)  with  respect  to  image  coor¬ 
dinates,  and  this  will  introduce  the  slopes  and 
curvatures  of  the  object  surface  on  the  line  of 
sight.  Thus,  to  the  desired  degree  of  resolution, 
we  must  be  able  to  describe  the  neighborhood  of  the 
surface  around  theiineof  sight  as  "smooth,"  i.e., 
twice  differentiable.  Given  the  locally  unique  sur¬ 
face  Z=^(X,Y) ,  we  can  describe  this  neighborhood 
by  a  Taylor  series  about  the  Z-axis: 


Z  -  Z 


o 


(4a) 


We  can  express  this  in  terms  of  image  coordinates 
by  recalling  (2)  that  X=xZ  and  Y=yZ,  hence  the 
implicit  equation  Z=£(xZ,yZ) .  This  can  be  con¬ 
verted  locally  to  an  explicit  surface  relation  Z= 
Z(x,y)  possessing  its  own  Taylor  series, 


(5a) 

(5b) 

(5c) 

(5d) 

(5e) 


Equations  (5)  relate  the  dimensionless  slopes  and 
curvatures  (in  units  of  Z0)  of  the  surface  descrip¬ 
tion  in  space  to  those  in  the  image  coordinate 
system.  All  quantities  in  (5)  are  evaluated  along 
the  line  of  sight  to  the  object,  hence  the  sub¬ 
script  zero. 

Guided  by  the  Cauchy-Stokes  Decomposition 
Theorem  (Aris  1962)  ,  we  proceed  to  study  the  defor¬ 
mation  of  the  image  flow  (3)  by  forming  the  image 
velocity-gradient  tensor,  and  then  decomposing  it; 
into  a  sum  of  symmetric  and  antisymmetric  parts : 


2  eij  +  “ij  ■ 

Here,  the  subscripts  i  and  j  can  each  take  the 
values  1  and  2,  with  (v;l  ,v2)  =  (v„,vy)  and  (^,£2)  = 
(x,y).  The  individual  elements  of  the  rate-of- 
strain  tensor  epj  ,  and  the  spin  tensor  tOj,  ,  ha\e 
geometrical  significance  in  describing  the  defor¬ 
mation  of  a  differential  neighborhood  of  any  image 
point  (x,y)  ,  though  here  we  focus  on  the  image 
origin.  The  symmetric  rate-of-strain  tensor  e^ 
has  three  independent  elements :  eXx=rate-of-stretci 
of  a  differential  image  line  oriented  along  the  x- 
axis,  eyy=rate-of-stretch  along  the  y-axis,  e  =ev  = 
one-half  the  rate-of-decrease  of  the  angle  between 
two  differential  segments  along  the  image  axes. 

The  antisymmetric  spin  tensor  has  only  one  in¬ 
dependent  element,  w=rate-of-rotation  of  the  differ¬ 
ential  neighborhood  of  image  about  the  origin.  An 
alternative  insight  may  be  gained  by  considering 
the  eigenvalues  and  eigenvectors  of  the  rate-of-strain 


tensor  (Aris  1962;  Koenderink  and  van  Doom  1975, 
1976);  whence,  one  finds  that  a  circular  neighbor¬ 
hood  of  the  point  (x,y)  dilates  according  to  the 
sum  of  the  eigenvalues  (equivalent  to  the  trace  of 
eij),  undergoes  a  stretch  and  compression  at  con¬ 
stant  area  according  to  the  difference  of  the 
eigenvalues  (along  mutually  perpendicular  axes 
aligned  with  the  eigenvectors  of  ej.)  thereby  de¬ 
forming  into  an  ellipse,  and  then  rotates  as  a 
locally  rigid  area  element  according  to  the  spin  a). 
This  deformation  is  superposed  on  a  uniform  trans¬ 
lation  of  the  infinitesimal  neighborhood  along 
with  the  point  (x,y)  moving  with  velocity  (vx,v  ). 
From  equations  (3)  we  obtain  the  expressions  ^ 


.  vz  + 

1 

zz  ! 

lvx 

r  T 

T  + 

1 

3x  j 

I*' 

xTj 

+  [ylJx  -  2x!!y]  , 

(7a) 

,  vz  + 

1 

3Z  J 

;vy 

vz) 

+  [>:x  -  xl\]  • 

X  + 

z 

3yj 

T  ' 

y^l 

(7b) 

1  ii  (5  .  M  1  fv  V  i 

7z  3x  j  Z  V  r]  +  2Z  Ty  |  T  '  x  t) 

+  2  [x,lx  -  y(JY]  . 


/  XX  y  VZ  1 

1  zz 

|vx 

Vz) 

|x  -  y  t) 

-  ZZ  37 

|x- 

x  T  | 

u  -  1  12  M  X  )Z 

2Z  lx  |T  y  '  J2  5y 
■  2  [*'h  +  :'nY  +2nz]  ■ 


Relations  (3)  and  (7),  evaluated  at  (x,y)= 
(0,0),  describe  the  rate  of  translation,  rotation, 
and  deformation  of  an  infinitesimal  neighborhood 
of  the  image  element  along  the  line  of  sight. 

Taken  together,  they  constitute  six  independent  re¬ 
lations  among  eight  unknowns  (three  translation 
rates  scaled  by  Z0,  three  rotation  rates,  and  the 
fwo  surface  slopes).  Clearly,  we  do  not  have  as 
yet  a  sufficient  number  of  relations  to  solve 
for  the  number  of  unknowns  encountered  so  far. 

This  motivates  our  forming  the  six  independent 
gradients  of  the  rates-of-strain  and  spin  (or 
equivalently,  the  independent  second  derivatives  of 
vx  and  vv) .  These  gradients  represent  comparisons 
of  the  stretch  and  rotation  rates  at  neighboring 
points.  But  rather  than  analyzing  the  infinitesi¬ 
mal  neighborhoods  of  two  individual  points,  these 
S*1*  lents  extend  our  kinematic  analysis  to  a 
fircfbe  neighborhood  of  a  single  image  point.  Uti- 
iizing  equations  (7),  we  derive  the  following  six 
Independent  relations: 
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Now  if  we  evaluate  equations  (3),  (7)  and  (8) 
at  (x,y)-(0,0) ,  there  result  twelve  independent 
relations  between  the  image  flow  representation 
and  tiie  eleven  parameters  describing  the  rigid  body 
motion  in  space  and  object  structure  in  a  finite 
neighborhood  of  the  line  of  sight.  In  order  to 
simplify  the  notation  throughout  the  remainder  of 
the  paper,  we  define  the  twelve  observables  0^ 

(i- 1,12)  as  the  following  kinematic  quantities 
evaluated  on  the  line  of  sight  (x,y)=(0,0): 
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Just  as  the  Cauchy-Stokes  Decomposition  Theorem 
describes  the  rate-of-deformation  of  an  infinite¬ 
simal  area  in  the  Image,  our  observables  describe 
the  deformation  of  a  finite  area  in  the  Image. 

We  refer  to  them  as  observables  for  they  constitute 
our  representation  of  the  local  image  flow,  and 
must  be  extracted  from  the  evolving  image.  In 
future  work  we  expect  to  relate  these  observables 
to  the  rate-of-deformatlon  of  the  "zero-crossing 
curves"  of  an  image  which  reflect  the  texture  of 
the  object's  surface  (Marr  1982). 

Next,  define  the  six  parameters  of  motion  Mi 
(j=l,6)  describing  the  observer's  rigid  body  motion 
through  space : 
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The  parameters  M3  and  M2  give  the  angular  veloci¬ 
ties  of  the  object  perceived  by  the  obse-ver  due  to 
his  translation  through  space.  Parameter  M3  is  the 
inverse  of  a  radial  collision  time  between  object 
and  observer  due  to  the  relative  velocity  of  ap¬ 
proach,  when  taken  alone.  Parameters  M4 ,  M5,  and 
Mf,  give  the  components  of  object  spin  perceived  by 
the  observer  due  to  his  own  spinning  motion.  bimi- 
larlv,  we  define  five  parameters  of  structure  (or 
topography)  (k=l,5),  utilizing  relations  (5), 
as  follows: 
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Parameters  Tq  and  T2  describe  the  slope  of  the 
object  surface  at  the  point  on  the  line  of  sight, 
and  also  yield  the  local  elements  of  the  surface 
metric  tensor.  Parameters  T„,  T4 ,  and  T5  yield 
(along  with  T.  and  T2)  the  three  independent  ele¬ 
ments  of  the  (dimensionless)  symmetric  curvature 
tensor  describing  the  variation  of  surface  slope  in 
the  neighborhood  of  theiineof  sight  (McConnell 
1957).  The  slopes  themselves  are  dimensionless; 
the  curvatures  are  scaled  by  the  distance  Z0  to 
the  object  along  the  line  of  sight. 

Thus,  evaluating  equations  (3),  (7)  and  (8)  at 
(x,y)=(0,0)  and  incorporating  the  definitions  (9), 
(10)  and  (11)  ,  we  obtain  the  following  twelve  kine¬ 
matic  relations  describing  the  image  flow  in  a 
finite  neighborhood  of  the  line  of  sight: 


Ox  "  -Mi  -  Mj  , 
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These  image  flow  equations  form  a  set  of 
Lwelve  coupled,  nonlinear,  algebraic  equations 
among  eleven  unknowns.  The  fact  that  they  are  non¬ 
linear  implies  the  possibility  of  multiple  solu¬ 
tions,  corresponding  to  ambiguous  scenes.  In  the 
next  section  we  shall  solve  these  equations, 
selecting  eleven  of  them  in  order  to  recover  possi¬ 
ble  sets  of  M,  and  T^  corresponding  to  given  03, 
and  reservingJ the  twelfth  equation  as  a  constraint 
relation  which  any  acceptable  solution  must  satis¬ 
fy.  As  we  shall  see,  two  classes  of  ambiguous 
scenes  emerge,  but  one  must  keep  in  mind  that  these 
ambiguities  are  local;  a  patching  together  of  local 
solutions  to  form  global  structure  and  motion 
models  may  well  break  these  ambiguities  in  most 
cases . 

Finally,  note  that  the  nonlinearities  inherent 
in  the  image  flow  equations  (12)  are  all  quadratic, 
being  formed  by  the  product  of  a  structure  para¬ 
meter  T^  with  one  of  the  translational  motion 
parameters  M3  (i=l,3).  In  fact,  the  slopes  T3  and 
T2  are  always  multiplied  by  M3,  M2,  °r  M3;  the 
curvatures  T^ ,  T4,  and  T5  always  appear  in  products 
with  M^  or  M2  alone.  That  is  to  say,  surface  slope 
is  revealed  by  translation  through  space,  while 
surface  curvature  is  revealed  by  translation  paral¬ 
lel  to  the  image  plane.  (Recall,  however,  that  the 
rigid  body  motion  of  interest  was  defined  with  re¬ 
spect  to  the  point  on  the  object's  surface  inter¬ 
sected  by  the  line  of  sight,  which  is  not  the  dyna¬ 
mical  center  of  mass  of  the  object  under  view.) 

3.  SOLUTION  BY  TRANSFORMATION 

By  "solving  the  image  flow  equations"  we  mean, 
given  the  twelve  observables  O3,  obtain  all  possible 
sets  {Mj.Tj^}  of  motion  and  structure  parameters 
that  are  consistent  with  relations  (12).  That  is, 
we  wish  to  invert  the  local  image  flow.  The  diffi¬ 
culty  in  solving  equations  (2)  stems  from  the  mul¬ 
titude  of  quadratic  nonlinearities  formed  by  pro¬ 
ducts  of  motion  and  structure  parameters.  Moreover, 
all  these  parameters  may  range  between  plus  and 
minus  infinity,  in  principle.  The  method  of  solution 
developed  in  (Waxman,  1983)  hinges  on  a  rotation  of 
the  image  coordinate  system  which  aligns  one  image 
axis  with  the  direction  of  zero  slope  on  the  object 
at  the  point  of  observation.  The  transformation 
angle  is  itself  an  unknown  which  is  to  be  solved 
for,  replacing  one  of  the  structure  parameters. 
However,  this  angle  a  is  bounded,  -90°  <  a  :  +90°, 
and  this  fact  makes  all  the  difference! 

Consider  the  differential  of  radial  distance 
between  observer  and  object,  in  the  neighborhood  of 
the  line  of  sight.  To  first  order  in  differential 
quantities,  this  is  equivalent  to  the  differential 
dZ  evaluated  at  (x,y)=(0,0).  We  have 

dx  +  (— )  dy  -  Z„  (T,  dx  +  Ti  dy);  (13a) 

W  o 
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hence 


T 

dZ0  •  0  along  *  -  -  ^  (13b) 


Relations  (13b)  indicate  a  direction  in  the  image 
plane,  passing  through  its  origin,  along  which  the 
distance  to  the  object  is  locally  constant,  i.e., 
it  is  the  direction  of  vanishing  slope.  (This 
direction  corresponds  to  an  equi-depth  contour  as 
is  found  in  Moire  photography  (Meadows,  et  al . 

1970;  Takasaki  1970)  of  objects.)  This  suggests 
constructing  a  new  set  of  image  coordinates  (x,y) 
rotated  by  an  angle  a  from  the  original  image  coor- 
dinates  (x,y),  as  shown  in  Figure  1.  By  aligning 
the  x-axis  with  the  direction  of  zero  slope  we 
have,  by  definition,  T^O.  According  tc  (13b),  the 
transformation  angle  is  uniquely  specified  by 

a  -  can" 1  (-Tj/T,)  with  -90°  <  a  s  +90°  .  (13c) 

except  at  "specular  points"  where  T  =T2=0.  Spe¬ 
cular  points  correspond  to  local  maxima,  minima, 
and  saddle  points  in  the  distance  between  observer 
and  object  surface.  At  these  points,  the  differ¬ 
ential  in  distance  vanishes  to  first  order  in  all 
directions  in  the  image  plane;  thus,  any  convenient 
value  may  be  chosen  for  a  when  the  line  of  sight 
intersects  such  a  point.  Of  course  the  rotation 
angle  a,  as  given  by  (13c),  is  not  known  ahead  of 
time  since  T^  and  T^  are  themselves  unknowns  to  be 
solved  for.  That  is,  the  angle  c,  replaces  the 
first  structure  parameter  (in  the  transformed 
system)  as  an  unknown. 

Just  as  we  can  specify  a  rotation  of  the  image 
coordinates  for  any  angle  Ot,  we  can  construct  cor¬ 
responding  transformation  relations  for  the  ob¬ 
servables,  the  parameters  of  motion, and_the  struc¬ 
ture  parameters,  i  .  e  .  {Oi.Mj ,Tk}  2  {Oi.Mj ,Tk}.  We 
are,  however,  interested  only  in  those  ctfor  which 
Ti=0.  The  required  transformations  are  derived  in 
(Waxman,  1983) .  __The  transformed  image  flow  equa¬ 
tions,  relating  M,  and  Tk  to  0i ,  are  simply  equa¬ 
tions  (12)  without  the  terms  involving  T, .  The 
problem  is  then:  given  the  0^,  find  all  solution 
sets  (ot,Mj,Tk)  of  the  transformed  image  flow  equa¬ 
tions  and  then  transform  the  motion  and  structure 
parameters  back  to  the  original  coordinates  (Mj 
Tkl^Mj.Tkl.  J’ 

Solving  equations  (12)  in  the  transformed  sys¬ 
tem  is  quite  straightforward;  and  finding  the  angle 
'Ot  is  not  difficult  either.  Solutions  generally 
divide  themselves  into  two  classes,  one  with  unique 
solutions  when  the  surface  possesses  local  curva¬ 
ture,  the  other  being  inherently  nonunique  with  a 
two-fold  ambiguity.  These  ambiguous  solutions  arise 
when  a  particular  coincidence  between  object  motion 
and  local  structure  occurs.  "Specular  points"  are 
an  important  exception,  giving  rise  to  unique  in¬ 
terpretations  despite  their  membership  in  the  second 
class  of  solutions.  Planar  surfaces  in  motion  also 
possess  a  two-fold  ambiguity,  except  when  there  is 
no  relative  approach  velocity  to  the  observer,  in 
which  case  the  interpretation  is  unique.  However, 
it  should  be  kept  in  mind  that  these  multiple  in¬ 
terpretations  are  "local  ambiguities,"  and  that  many 


of  them  may  be  resolved  in  tne  process  of  building 
glooal  structure  models  by  piecing  together  solu¬ 
tions  from  neighboring  regions.  A  sensitivity 
analysis  was  also  begun  in  order  to  ascertain  the 
effects  of  noise  (or  uncertainty)  in  the  obser¬ 
vables.  Preliminary  results  suggested  that  the 
method  is  quite  stable,  though  further  studies  re¬ 
main  to  be  done  in  this  area  (see  (Waxman,  1983) 
for  details) . 

4 .  CONCLUDING  REMARKS 

We  have  introduced  a  new  representation  of 
a  local  image  flow  in  terms  of  the  image  veloci¬ 
ties,  strain  rates,  ^pin,  and  image  gradients 
of  strain  rate  and  spin,  evaluated  along  the  line 
of  sight  to  a  moving  surface.  A  set  of  twelve 
kinematic  relation;  (nonlinear  algebraic  equa¬ 
tions)  were  derived  which  relate  these  representa¬ 
tions  parameters  ("observables")  to  the  local 
surface  slopes,  ;urvatures,  and  parameters  of  rigid 
body  motion.  A  method  to  solve  these  equations 
for  the  structure  and  motion  parameters,  given  the 
observables,  hf.s  been  developed  and  implemented  on 
a  VAX-730  computer. 

The  next  chase  of  this  work  will  be  directed 
at  extracting  the  "observables"  from  an  evolving 
image  sequence.  In  this  regard  we  hope  to  exploit 
the  neighborhood  interpretation  of  the  local  image 
flow  representation  adopted  here.  That  is,  our 
"observables"  actually  describe  the  rate-of- 
deformation  of  a  small  but  finite  neighborhood  in 
the  image,  arounl  the  line  of  sight.  As  the  de¬ 
formation  of  a  ne’ghborhood  can  be  ascertained  by 
studying  the  deformation  of  its  bounding  curve,  we 
expect  to  be  able  to  obtain  all  of  the  observables 
by  following  the  deformation  of  closed  "zero¬ 
crossing  curves"  of  the  image  intensity  variation 
map .  We  anticipate  that  this  neighborhood  ap¬ 
proach  should  also  lead  to  a  rather  robust  method 
of  obtaining  the  required  observables,  more  so  than 
tracking  a  set  of  points  through  an  evolving  image 
seque nee. 
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1.  Abstract 

This  note  describes  ways  of  formally  expressing 
facts  about  images.  It  is  shown  that  some  of  the  known 
heuristics  for  deducing  facts  about  scenes  from  images 
can  actually  be  proved  correct. 


2.  Introduction 

Our  intent  is  to  provide  a  ’’mathematical”  com¬ 
mentary  for  some  of  the  recent  work  on  interpretations 
of  image  data,  hi  particular,  we  can  show  that  a  few 
of  the  coincidence  assumptions  stated  by  Miriford  in  [1] 
can  actually  be  proved  in  a  suitable  formal  framework. 

One  should  not  expect  formalisations  of  theories 
to  have  tangible  connections  with  succcsfull  implemen¬ 
tations  of  algorithms;  Artificial  Intelligence  programs 
need  not  be  based  on  the  paradigm  of  theorem  prov¬ 
ing.  However,  the  clarification  of  the  formal  concepts 
underlying  these  systems  can  be  of  great  importance 
in  terms  of  program  architecture  and  further  develop¬ 
ment.  Axiomatization  of  knowledge  domains  can  there¬ 
fore  be  viewed  as  a  pedagogical  (or  philosophical)  exer¬ 
cise,  the  basic  intent  being  the  elucidation  of  the  fun¬ 
damental  concepts  involved  and  their  relationships. 

We  will  provide  a  formal  framework  for  discussing 
the  representation  of  polyhcdra  by  line  drawings.  It  fol¬ 
lows  from  our  analysis  that  many  of  the  "impossible” 
pictures  of  Huffman  in  [3]  can  be  detected  by  simpler 
means  than  the  ones  used  by  Huffman  [3],  Clowes  [2], 
Waltz  [7]  or  Wesley  and  Markowsky  [4].  Given  that  our 
methods  arc  simpler  (even  if  not  complete),  they  may 
be  closer  to  the  process  actually  used  by  the  human 
visual  system.  They  have  the  additional  advantage  of 
being  gencralizable  with  suitable  modifications  to  semi- 
algebraic.  sets  (i.e.  objects  defined  through  sets  o(  al¬ 
gebraic  inequalities),  though  this  will  not  be  taken  up 
in  this  note.  For  the  sake  of  simplicity,  only  objects 
defined  by  linear  varieties  are  discussed. 


3.  Formal  Background 

We  follow  the  standard  conventions  of  geometry 
and  topology  in  our  notation  and  terminology.  For 
background,  one  may  consult  ltourkc  and  Sanderson  [5] 
or  Mumford  [6]  -  the  techniques  presented  in  this  paper 


(though  mathematically  rather  trivial)  have  the  flavor 
of  algebraic  geometry. 

The  terminology  of  Wesley  and  Markowsky  [4]  is 
used  to  describe  the  fundamental  objects  of  study. 

DEFINITION  1:  A  face  f  is  the  closu  re  of  a  non¬ 
empty,  bounded,  connected,  coplanar,  open  subset  of 
SI?3  whose  boundary  is  the  union  of  a  finite  number  of 
line  segments. 

OKI'  INI  DON  2:  An  object  0  is  the  closure  of  a 
nonempty,  bounded,  open  subset  of  51? 3  whose  boundary 
is  the  union  of  a  finite  number  of  faces. 

from  now  on,  by  a  face  of  an  object  0  we  mean  a 
maximal  face  contained  in  DO.  It  is  easy  to  sec  that  any 
object  is  a  compact  polyhedron;  in  particular  a  finite 
union  of  si  in  [dices 

DEFINITION  3:  The  vertices  of  /  is  the  set  of 
all  points  for  which  two  noncollinear  line  segments  con¬ 
tained  in  the  boundary  or  /  can  be  found  whose  inter¬ 
section  is  the  given  point. 

DEI  INITION  4:  The  edges  of  /  is  the  set  of  all  line 
segments  c  contained  in  the  boundary  of  /  such  that  all 
the  endpoints  of  e  are  vertices  and  no  interior  point  of 
e  is  a  vertex. 

The  vision  task  may  be  modeled  by  a  (non-trivia! 
linear)  projection  II  :  5R3  ->  5)?2.  We  hypothesize  an 
object  0  contained  in  the  positive  part  of  5)?3  in  general 
position  with  respect  to  IT ;  i.c.,  any  linear  variety 
generated  by  the  vertices  of  0  is  in  general  position 
with  respect  to  II.  There  is  a  natural  ordering  -<  on 
the  fibers  H_l({x}).  Since  0  is  compact,  the  set  0  PI 

non-empty,  has  a  -<-  least  clement,  which 
we  we  denote  by  ST. 

DEFINITION  5:  The  set  S  =  {Sx|x  <E  11(0)}  is 
the  visible  part  of  0  under  II. 

It  follows  that  7  is  a  compact  polyhedron  and  that 
the  map  H  :  S  —  >  11(0)  is  bijcctive.  In  fact,  one  can 
show  that  for  any  x 

|n-‘({x}) n  J|  <  oo. 
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ABSTRACT 

This  piper  addresses  the  problems  of  (1)  representing 
natural  shapes  such  as  mountains,  trees  and  clouds,  and 
computing  such  a  description  from  image  data.  In  order  to 


solve  these  problems  wc  must  he  ahle  to  relate  natural  surfaces 
to  their  images-  this  requires  a  good  model  of  natural  surface 
shapes.  Fractal  functions  are  good  a  choice  for  modeling  natural 
surfaces  because  (I)  many  physical  processes  produce  a  fractal 
surface  shape,  (2)  fractals  are  widely  used  as  a  graphics  tool  for 
generating  natural-looking  shapes,  and  (3)  a  survey  of  natural 
imagery  has  showu  that  the  3-D  fractal  surface  model,  trans¬ 
formed  by  the  image  formation  process,  furnishes  an  accurate 
description  of  both  textured  and  shaded  image  regions.  This 
characterization  of  image  regions  has  heen  shown  to  he  stable 
over  transformations  of  scale  and  linear  transforms  of  intensity. 

Much  work  has  heen  accomplished  that  is  relevant  to  com¬ 
puting  3-D  information  from  tbe  image  data,  and  the  computa¬ 
tion  of  a  3-D  fractal-based  representation  from  actual  image  data 
has  been  demonstrated  using  an  image  of  a  mountain.  This  ex¬ 
ample  shows  the  potential  of  a  fractal-hased  representation  for 
efficiently  computing  good  3-D  representations  of  natural  shapes, 
including  such  seemingly-difficult  cases  as  mountains,  clumps  of 
leaves  and  clouds. 

\ 

\ 

1.  INTRODUCTION 

This  paper  addresses  two  related  problems:  (1)  representing 
natural  shapes  such  as  mountains,  trees  and  clouds,  and  (2) 
computing  such  a  description  from  image  data.  The  first  step 
towards  solving  these  problems,  it  appears,  is  to  obtain  a  model 
of  natural  surface  shapes.  The  task  of  finding  such  a  model  is 
extremely  important  to  computer  vision  because  we  face  prob¬ 
lems  that  seem  impossible  to  address  with  standard  descriptive 
techniques.  Mow,  for  instance,  should  we  describe  the  shape 
of  leaves  on  a  tree?  Or  grass?  Or  clouds?  When  we  attempt 
to  describe  such  common,  natural  shapes  using  standard  shape- 
primitive  representations,  the  result  is  an  unrealistically  compli¬ 
cated  model  ol  something  that,  viewed  introspectively,  seems 
very  simple. 

The  research  reported  herein  was  supported  hy  the  Defense 
Advanced  Research  Projects  Agency  under  Contract  No.  MDA 
903-83-0-0027;  this  contract  is  monitored  hy  the  U.  S.  Army 
Engineer  Topographic  Laboratory.  Approved  for  public  release, 
distribution  unlimited. 


Figure  1.  Fractal-based  models  of  natural  shapes,  by  Mandelbrot 
and  Voss  jl]. 

Furthermore,  how  can  we  extract  3-D  information  from 
the  image  of  a  textured  surface  when  we  have  no  models  that 
describe  natural  surfaces  and  how  they  evidence  themselves  in 
the  image?  The  lack  of  such  a  3-D  model  has  generally  restricted 
image  texture  descriptions  to  being  ad  hoc  statistical  measures 
of  the  image  intensity  surface.  A  good  model  of  natural  surfaces 
together  with  t.he  physics  of  image  formation  would  provide  the 
analytical  tools  necessary  for  relating  natural  surfaces  to  their 
images.  The  ability  to  relate  image  to  surface  can  provide  the 
necessary  leverage  for  dealing  appropriately  with  the  problems  of 
finding  a  good  representation  for  natural  surfaces  and  computing 
such  a  description  from  the  image  data. 

Even  shapc-from-shading  [22,23]  and  surface-interpolation 
methods  [2  Ij  are  limited  hy  the  lack  of  a  3-D  model  of  natural 
surfaces.  Currently  all  sucb  methods  employ  the  heuristic  of 
“smoothness’’  to  relate  neighboring  points  on  the  surface.  Such 
heuristics  are  applicable  to  many  man-made  surfaces,  of  course, 
but  are  demonstrably  untrue  of  roost  natural  surfaces.  In  order 
to  apply  such  techniques  to  natural  surfaces,  therefore,  we  must 
find  a  heuristic  that  is  true  of  natural  surfaces.  Finding  such  a 
heuristic  requires  recourse  to  a  3-D  model  of  natural  surfaces. 
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The  image  11(0)  is  again  a  union  of  faces.  In  fact, 
for  any  x  £  011(0)  there  is  an  edge  e  of  0  such  that 

ii“l({x}) n  enl ^  0. 

DEFINITION  6:  The  picture  Pict{0 ,11)  associated 
to  II,  0  is  the  collection 

{II(enI)|c6  i?(0),cnl^0}, 

where  E(0)  denotes  the  set  of  all  edges  of  the  object  0. 

It  is  easy  to  see  that  the  image  11(0)  can  be 
generated  from  Pict(0 ,H). 


4.  Facts  from  Images 

We  shall  enumerate  several  facts  about  interpreta¬ 
tions  of  pictures  which  are  now  provable  in  the  for¬ 
mal  setup  given  above.  They  can  all  be  found,  in  one 
form  or  another,  in  Binford  [lj.  In  the  following,  e  = 

Il(e  (~|  ~5)  and  /  =  II(/  [~|  IT)  denote  arbitrary  elements 
of  rict{0,  II). 

FACT  1:  Any  element  e  6  /:>ict(0 ,  II)  is  an  image 
of  an  edge  of  0,  lying  on  two  non-coplanar  faces. 

FACT  2:  If  AC  n(0)  is  an  open,  connected  set 
which  does  not  intersect  any  set  from  Pict(0 , 11),  then 
U-1(d)ny  is  a  part  of  a  face  (in  particular,  contained 
in  a  plane)  and  II  :  Il~l(A)  PI  X  — ►  A  is  bijcctivc. 

FACT  3:  If  e,  /  £  Pict{C ,11)  are  parallel,  then  so 
are  e  and  /. 

FACT  4:  If  P,Q,R  are  distinct  collincar  points 
that  can  be  expressed  as  intersections  of  elements  of 
Pict(0,  II),  then  there  are  collincar  points  p,q,r  above 
them  in  7 . 

The  following  three  facts  give  a  complete  analysis 
of  points  of  intersection  in  Pict(0  ,U).  They  depend 
strongly  on  the  linearity  and  general  position  assump¬ 
tions. 

FACT  5:  Assume  e,  /  6  Pict(0,U)  form  a  L-vertex 
at  a  point  P\  i.e.  P  is  an  endpoint  for  both  of  them, 
and  they  are  not  collincar.  Then  c  and  /  intersect  at  a 
point  above  P. 

FACT  6:  Assume  e,  /  £  Pict{0,  IT)  form  a  T- 
junction  at  a  point  P\  i.e.  P  is  an  endpoint  for  e  and 
an  interior  point  for  /.  Then  e  and  /  do  not  intersect; 
in  fact  e  contains  a  point  above  /  and  P. 


FACT  7:  Assume  c,  /  £  />irt( 0,11)  form  a  X  vertex 
at  a  point  P\  i.e.  P  is  an  interior  point  for  e, /.  Then  e 
and  f  intersect  at  a  point  above  P. 

These  facts  are  quite  sufficient  to  determine  the  im¬ 
possibility  of  say,  the  Penrose  triangle  (see  f.ex.  Binford 
[  1  ] ,  figure  6).  In  that  case  the  image  can  be  proved 
to  depict  three  planar  regions  bordered  by  three  lines 
which  do  not  have  a  common  point  of  intersection. 
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Fractal  functions  seem  to  provide  such  a  model  of  natural 
surface  shapes.  Fractals  are  a  novel  class  of  naturally- 
arising  functions  discovered  primarily  hy  Benoit  Mandelbrot. 
Mandelbrot  and  others  [1,2,4]  have  shown  that  fractals  are 
found  widely  in  nature  and  that  a  numher  of  basic  physical 
processes,  such  as  erosion  and  aggregation,  produce  fractal  sur¬ 
faces.  Because  fractals  look  natural  to  human  beings,  much 
recent  computer  graphics  research  has  focused  o  t  using  fractal 
processes  to  simulate  natural  shapes  and  textures  (see  Figure  1), 
including  mountains,  clouds,  water,  plants,  trees,  and  primitive 
animals  [3,4,5 ,6,7 j .  Additionally,  we  have  recently  conducted  a 
survey  of  natural  imagery  and  found  that  a  fractal  model  of 
imaged  3-D  surfaces  furnishes  an  accurate  description  of  hoth 
textured  and  shaded  image  regions,  thus  providing  validation  of 
this  physics-derived  model  for  hoth  image  texture  and  shading 
[10] 

2.  FRACTALS  AND  THE  FRACTAL  MODEL 

During  the  last  twenty  years,  Benoit  B.  Mandelbrot  has  de¬ 
veloped  and  popularized  a  relatively  novel  class  of  mathematical 
functions  known  as  fractals  [1,4].  Fractals  are  found  widely 
in  nature  [1,2,4],  Mandelhrot  shows  that  a  number  of  basic 
physical  processes,  ranging  from  the  aggregation  of  galaxies  to 
the  curdling  of  cheese,  produce  fractal  surfaces.  One  general 
characterization  is  that  any  process  that  acts  locally  to  produce 
a  permanent  change  in  shape  will,  after  innumerable  repetitions, 
result  in  a  fractal  surface.  Examples  are  erosion,  turhulent  flow 
(e.g.,  of  rivers  or  lava)  and  aggregation  (e  g.,  galaxy  formation, 
meteorite  accretion,  and  snowflake  growth).  Fractals  have  also 
been  widely  and  successfully  used  to  generate  realistic  scenes  (see 
Figure  1).  including  mountains,  clouds,  water,  plants,  trees,  and 
primitive  animals  [3,1 ,5,0,7 ] . 

Perhaps  the  most  familiar  examples  of  naturally  occurring 
fractal  curves  arc  coastlines.  When  we  examine  a  coastline  (as 
in  figure  1).  we  see  a  familiar  scalloped  curve  formed  by  in¬ 
numerable  bays  and  peninsulas.  If  we  then  examine  a  liner-scale 
uiap  of  the  same  region,  we  shall  again  see  the  same  type  of 
curve  It  turns  out  that  this  characteristic  scalloping  is  present 
at  all  scales  of  examination  [2],  i.e.,  the  statistics  of  the  curve 
are  invariant  with  respect  to  transformations  of  scale.  This  fact 
causes  problems  when  we  attempt  to  measure  the  length  of  the 
coastline,  because  it  turns  out  that  the  length  we  are  measur¬ 
ing  depends  not  only  on  the  coastline  but  also  on  the  length  of 
the  measurement  tool  itself  [2]!  This  is  because,  whatever  the 
size  measuring  tool  selected,  all  of  the  curve  length  attributable 
to  features  smaller  than  the  site  of  the  measuring  tool  will  be 
missed.  Mandelbrot,  pointed  out  that,  if  we  generalize  the  notion 
of  dimension  to  include  fractional  dimensions  (from  which  we 
grt  the  word  "fractal"),  we  can  obtain  a  consistent  measurement 
of  the  coastline’s  length. 


tsion .  may  be  illustrated  (and  roughly  defined)  hy  the  examples 
( I )  of  measuring  the  length  of  an  island's  coastline,  and  (2) 
measuring  the  area  of  the  island. 

To  measure  the  length  of  the  coastline  we  might  select  a 
measuring  stick  of  length  X  and  determine  that  n  such  measuring 
sticks  could  be  placed  end  to  end  along  the  coastline  The  length 
or  tlm  coastline  is  then  intuitively  nX.  If  we  were  measuring  the 
area  of  the  island,  we  could  use  a  square  of  area  X2  to  derive 
an  area  of  mX2,  where  rn  is  the  numher  of  squares  it  takes  to 
cover  the  island.  If  we  actually  did  this  we  would  find  that  hoth 
of  these  measurements  vary  with  X,  the  length  of  the  measuring 
instrument  an  undesirable  result. 

In  these  two  examples  the  length  X  is  raised  to  a  particular 
power:  the  power  of  one  to  measure  length,  the  power  of  two 
to  measure  area.  These  are  two  examples  of  the  general  rule  of 
raising  X  to  a  power  that  is  the  dimension  of  the  object  heing 
measured,  lu  the  case  of  the  island,  raising  X  to  the  topological 
dimension  does  not  yield  consistent  results.  If,  however,  we 
were  to  use  the  power  1.2  instead  of  1.0  to  measure  the  length, 
and  2.1  instead  of  2.0  to  measure  the  area,  we  would  find  that 
the  measured  length  and  area  remained  constant  regardless  of 
the  size  of  the  measuring  instrument  chosen.’  The  positive  real 
number  D  that  yields  such  a  consistent  measurement  is  the 
fractal  dimension.  D  is  always  greater  than  or  equal  to  the 
topological  dimension. 

The  most  important  lesson  the  work  of  Mandelhrot  and 
others  teaches  us  is  the  following: 

Standard  notions  of  length  and  area  do  not  produce 
consistent  measurements  for  many  natural  shapes:  the 
basic  metric  properties  of  these  shapes  vary  as  a  func¬ 
tion  of  the  fractal  dimension.  Fractal  dimension,  there¬ 
fore,  is  a  necessary  part  of  any  consistent  description  of 
such  shapes. 

This  result,  which  could  almost  he  stated  as  a  theorem, 
demonstrates  the  fundamental  importance  of  knowing  the  frac¬ 
tal  dimension  of  a  surface.  It  implies  that  any  description  of  a 
natural  shape  that  does  not  include  the  fractal  dimension  cannot 
be  relied  upon  to  be  correct  at  more  than  one  scale  of  examina¬ 
tion. 


Fractal  Brownian  functions.  Virtually  all  the  fractals 
encountered  in  physical  models  have  two  additional  properties: 
( I)  each  segment  is  statistically  similar  to  all  others;  (2)  they  are 
statistically  invariant  over  wide  transformations  of  scale.  Motion 
of  a  particle  undergoing  Brownian  motion  is  the  canonical  ex¬ 
ample  of  this  type  of  fractal.  The  discussion  that  follows  will  be 
devoted  exclusively  to  fractal  Brownian  functions,  a  generaliza¬ 
tion  of  Brownian  motion. 

A  random  function  O(j-)  is  a  fractal  Brownian  function  if 
for  ail  i  and  A.r 
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The  definition.  A  fractal  is  defined  as  a  set  for  which 
the  llausdorlT  Besicovicb  dimension  is  strictly  larger  than  the 
topological  dimension.  Topological  dimension  corresponds  to 
the  standard,  intuitive  definition  of  “dimension.”  Ilausdorff- 
Besicovich  dimension  D,  also  referred  to  as  the  fractal  dimc.n- 


wherc  /  (</)  is  a  cumulative  distribution  function  [I],  The  fractal 

’This  example  is  discussed  at  greater  length  in  Mandelbrot’s 
book,  “Fractals:  Form,  Chance  and  Dimension.”  The  empirical 
data  are  from  Richardson  1961. 
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dimension  I)  cf  (lie  graph  described  by  /^( jr )  is 

I)  =  2 -II  (21 

If  II  —  1/2  and  I'ly)  is  ;»  zero-mean  Gaussian  with  unit  variance 
then  /i( x)  is  the  classical  Brownian  function.  This  definition 
has  obvious  extensions  to  two  or  more  topological  dimensions, 
The  fractal  dimension  of  a  fractal  Brownian  fuuction  can  also 
be  measured  from  its  I  otirier  power  spectrum,  as  the  spectral 
density  of  i  fractal  Brownian  function  is  proportional  to  , 

l'iscns-jon  of  the  rather  technical  proof  of  this  fact  mxv  be  found 
in  [Ij 

The  fractal  dimension  of  a  surface  corresponds  roughly  to 
our  intuitive  notion  of  jaggedness.  Thus,  if  we  were  to  general! 
a  series  of  scenes  with  t lie  same  3-D  relief  but  increasing  fractal 
dimension  l>,  we  would  obtain  the  following  sequence:  first,  a 
fiat  plane  {!>  =  2).  then  rolling  countryside  ( I )  fw‘2.1),  a  worn, 
old  mount  Tin  range  (!)  =  2.3),  a  young,  rugged  mountain  range 
(D  2  "I),  and  finally  a  stalagmite-covered  plane  (D  2.8). 

The  fractal  dimension  of  a  surface  is  invariant  with  respect 
to  transformations  of  scale,  as  Aj  is  independent  of  II  and 
l’(y).  The  fractal  dimension  is  also  invariant  with  respect  to 
linear  transformations  of  the  data  and  thus  it  remains  stable 
over  smoot h  inonotonic  transformations. 

2.1  Fractals  And  The  Imaging  Process 

Before  we  can  use  a  fractal  model  of  natural  surfaces  to 
help  us  understand  images,  however,  we  must  determine  how 
the  imaging  process  maps  a  fractal  surface  shape  into  an  image 
intensity  surface.  The  mathematics  of  this  problem  is  difficult 
and  no  complete  solution  has  as  yet  been  achieved.  Nonetheless, 
simulation  of  ( lie  imaging  process  with  a  variety  of  fractal  surface 
models  can  provide  us  with  an  empirical  answer  i.e.,  that 
images  of  fractal  surfaces  are  themselves  fractal  as  long  as  the 
fractal-generating  function  is  spatially  isotropic  [10],  It  is  worth 
noting  that  practical  fractal-generation  techniques,  such  ax  those 
used  in  computer  graphics,  have  had  to  constrain  the  fractal 
generating  function  to  he  isotropic  so  that  realistic  imagery  could 
be  obtained  [fj. 

Real  images  do  not,  of  course,  appear  fractal  over  all  pos- 
iblc  scales  of  examination.  The  overall  sire  of  the  imaged  surface 
places  an  upper  limit  on  the  range  of  scales  for  which  the  surface 
shape  appears  to  be  fractal,  and  a  lower  limit  is  set  by  the  size 
of  the  surface's  constituent  particles.  In  between  these  limits, 
however  we  may  use  liquation  (1)  to  obtain  a  useful  description 
of  the  surface. 

Simulation  shows  that  the  fractal  dimension  of  the  physical 
surface  dictates  the  fractal  dimension  of  the  image  intensity 
surface;  it  appears  that  the  fractal  dimension  of  the  image  is 
a  logarithmic  function  of  the  fractal  dimension  of  the  surface. 
If  we  assume  that  the  surface  is  homogeneous,  therefore,  we 
can  estimate  the  fractal  dimension  of  the  surface  hy  measuring 
the  fractal  dimension  of  the  image  data.  Even  if  the  surface  is 
not  homogeneous,  we  can  still  infer  the  fractal  dimension  of  the 
surface  from  imaged  surface  contours  and  bounding  contours, 
hy  use  of  Mandelbrot’s  results. 

What  we  have  develo  >ed,  then,  is  a  method  for  inferring 
a  basic  property  of  the  3-D  surface  (its  fractal  dimension)  from 


tin'  image  data  The  fact  that  the  fractal  dimension  corresponds 
closely  to  our  intuitive  notion  of  roughness  shows  the  impor¬ 
tance  of  the  measurement:  we  can  now  discover  from  the  image 
data  whether  the  3-D  surface  is  rough  or  smooth  isotropic  or 
anisotropic  We  can  know  in  effect,  what  kind  of  cloth  the 
surface  was  cut  front  The  fact  that  the  fractal  dimension  also 
describes  the  basic  metric  properties  of  the  imaged  surface  is 
furt Iter  indication  that  it  is  a  critical  clement  in  any  consistent 
representation  of  natural  surfaces. 

2.2  Applicability  Of  The  Fractal  Model 

\u  implication  of  the  fractal  surface  model  is  that  the  image 
intensity  surface  is  itself  fractal  and  t  ire  verm.  This  is  be¬ 
cause  image  intensity  is  primarily  a  function  of  t lie  angle  between 
the  surface  normal  and  the  incident  illumination;  thus,  if  the 
image  intensities  satisfy  liquation  (I),  then  (for  a  homogeneous 
surface)  the  angle  between  surface  normal  and  illiiiiiiiiant  must 
also  and,  integrating,  we  find  that  the  3-D  surface  is  a  spatially 
isotropic  fractal. 

\  method  of  evaluating  the  usefulness  of  t ti ••  fractal  sur¬ 
face  model,  therefore,  is  to  determine  whether  or  not  images  of 
natural  surfaces  are  well  described  by  a  fractal  function.  To 
evaluate  the  applicability  of  the  fractal  model,  we  first  rewrite 
I  qu.it ion  (I)  to  obtain  the  following  description  of  the  manner 
in  wliirli  the  second-order  statistics  of  the  image  change  with 
scale: 

£(M/A*i)iiAIirH-E(i<f/,i)  0) 

whore  k  is  a  roust  ant  and  is  the  expected  value  of 

the  change  in  intensity  over  distance  Ar,  Equation  (3)  is  a 
hypothesized  relation  among  the  image  intensities;  a  hypothesis 
that  we  may  test  statistically.  If  we  find  that  Equation  (3)  is 
true  of  the  image  intensity  surface  ami  the  viewed  surface  is 
homogeneous  and  continuous  then  we  may  conclude  that  the  3- 
I)  surface  is  itself  fractal.  It  is  an  important  characteristic  of 
the  fractal  model  that  we  can  determine  its  appropriateness  for 
particular  image  data  because  it  means  that  we  can  know  when, 
and  when  not.  to  use  the  model. 

To  evaluate  the  suitability  of  a  fractal  model  for  natural 
textures,  the  homogeneous  regions  from  each  of  six  images  of 
natural  scenes  were  densely  sampled.  In  addition,  twelve  tex¬ 
tures  taken  from  Brodatz  [8]  were  digitized  and  examined  (see 
Eigtire  3).  The  intensity  values  within  each  of  these  regions  were 
then  approximated  by  a  fractal  Brownian  function  and  the  ap¬ 
proximation  error  observed. 

l  or  the  majority  of  the  textures  examined  (77%),  the  model 
described  the  image  data  accurately  (see  [ 1 9)  for  more  detail). 
In  15%  of  the  cases  the  region  was  constant  except  for  random, 
zero-mean  perturbations;  consequently,  the  fractal  function  cor¬ 
rectly  approximates  the  image  data,  although  the  fractal  dimen¬ 
sion  was  equal  to  the  topological  dimension  and  thus  the  data’s 
dimension  is  technically  not  “fractional.”  The  fit  was  poor  in 
only  8%  of  the  regions  examined  and,  in  many  rf  these  eases,  it 
appeared  that  the  image  digitization  had  become  saturated. 

The  fact  t  hat  t  lie  vast  majority  of  the  regions  examined  were 
quite  well  approximated  by  a  fractal  Brownian  function  indicates 
that  the  fractal  surface  model  will  provide  a  useful  description  of 
natural  surfaces  and  their  images  Fractal  Brownian  functions 
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do  no/,  of  course,  account  for  such  large-scale  spatial  structure 
as  those  seen  in  the  image  of  a  hrick  wall  or  a  tiled  floor.  Such 
structures  must  he  accounted  for  by  other  means. 

3.  INFERRING  SURFACE  PROPERTIES 

fractal  functions  appear  to  provide  a  good  description  of 
n  Mural  surface  textures  and  their  images;  thus,  it  is  natural 
to  use  the  fractal  model  for  texture  segmentation,  classification 
and  sliape-from-textnre.  The  first  four  headings  of  this  section 
describe  the  research  that  has  been  performed  in  this  area,  and 
indie  ite  likelv  directions  for  further  research. 

fractal  functions  with  II  ==■  0  can  be  used  to  model  smooth 
surfaces  and  their  reflectance  properties.  For  the  first  time, 
therefore,  we  can  offer  a  single  model  encompassing  both  image 
shading  and  texture,  with  shading  as  a  limiting  case  in  the 
sped  mm  of  texinre  granularity.  The  fractal  model  thus  allows 
ns  to  make  a  reasonable  and  rige<rous  definition  of  the  categories 
'  texture"  and  “shading,"  thus  enabling  ns  to  discover  similarities 
and  differences  between  them.  The  final  heading  of  this  section 
briefly  discusses  this  result. 

3.1  An  Example  Of  Texture  Segmentation 

Figure  2(a)  shows  an  aerial  view  of  San  Francisco  Bay.  This 
image  was  digitized  and  the  fractal  dimension  computed  for 
each  8  X  8  block  of  pixels.  Figure  2(b)  shows  a  histogram  of 
the  fractal  dimensions  computed  over  the  whole  image.  This 
histogram  of  fractal  dimension  was  then  broken  at  the  “valleys" 
between  the  modes  of  the  histogram,  and  the  image  segmented 
into  pixel  neighborhoods  belonging  to  one  mode  or  another.* 
Figure  2(c)  shows  the  segmentation  obtained  by  thresholding 
at  the  breakpoint  indicated  hy  the  arrow  under  ( b);  each  pixel 
in  (c)  corresponds  to  an  8  X  8  hlock  of  pixels  in  the  original 
image.  \s  can  he  seen,  a  good  segmentation  into  water  and  land 
was  achieved  one  that  cannot  he  obtained  by  thresholding  on 
image  intensity. 

This  image  was  then  averaged  down,  from  512  X  512  pixels 
into  25(5  X  250  and  128  X  128  pixel  images,  and  the  fractal 
dimension  recomputed  for  each  of  the  reduced  images.  Figures 
■I  (d)  and  (e)  illustrate  the  segmeutations  produced  hy  using 
the  some  breakpoint  as  had  been  employed  in  the  original  full- 
resolution  segmentation.  These  results  demonstrate  the  stahility 
of  the  fractal  dimension  measure  across  wide  (4:1)  variations 
in  scale. 

Several  other  images  have  heen  segmented  in  this  manner 
[19].  In  each  case  a  good  segmentation  was  achieved.  The 
computed  fractal  dimension,  and  thus  the  segmentation,  was 
found  to  be  stable  over  at  least  4  :  1  variations  in  scale;  most  were 
stable  over  a  range  of  8  1  Stahility  of  the  fractal  description 

is  to  be  expected,  because  the  fractal  dimension  of  the  image  is 
directly  related  to  the  fractal  dimension  of  the  viewed  surface, 

*No  attempt  was  made  to  incorporate  orientational  information 
into  measurement  of  the  local  fractal  dimension,  i.e.,  differences 
in  dimension  among  various  image  directions  at  a  point  were 
collapsed  into  one  average  measurement. 


Figure  2.  San  Francisco  Bay,  and  its  texture  segmentations. 

which  is  a  property  of  natural  surfaces  that  has  been  shown  to 
he  invariant  with  respect  to  transformations  of  scale  [2]. 

The  fact  that  the  fractal  description  of  texture  is  stable 
with  respect  toscale  is  a  critically  important  property.  After  all, 
consider:  how  can  we  hope  to  compute  a  stable,  viewer- 
independent  representation  of  the  world  if  our  informa¬ 
tion  about  the  world  is  not  stable  with  respect  to  scale? 
This  example  of  texture  property  measurement  reiterates  what 
we  observed  earlier,  i.e.,  the  fact  that  the  fractal  dimension  of 
the  surft.ee  is  nero-.mrt/  to  any  consistent  description  of  a  natural 
surface. 

3.2  A  Comparison  With  Other  Segmentation  Techniques 

To  obtain  an  objective  comparison  with  previously  estab¬ 
lished  texture  segmentation  techniques,  a  mosaic  of  eight  natural 
textures  taken  from  Brodatz  [8]  was  redigitized.  The  digitized 
texture  mosaic,  shown  in  Figure  3,  was  constructed  by  Laws 
[9,10]  for  the  purpose  of  comparing  various  texture  segmentation 
procedures.  The  text  tires  t  hat  comprise  this  data  set  were  chosen 
to  be  as  visually  similar  as  possible;  gross  statistical  differences 
were  removed  by  mean-value-  and  histogram-equalization. 

Segmentation  performance  for  these  data  exists  for  several 
techniques  and,  although  differences  in  digitization  complicate 
any  comparisons  we  might  wish  to  make,  Laws's  performance 
figures  nevertheless  serve  as  a  useful  yardstick  for  assessing  per¬ 
formance  on  this  data. 

For  this  comparison  simple  orientational  information  was 
incorporated  into  the  fractal  description;  the  fractal  dimension 
was  calculated  separately  for  the  i  and  y  coordinates  The  two- 
parameter  fractal  segmenter  yielded  a  theoretical  classification 
accuracy  of  84.49?.  This  compares  quite  favorably  with  correla¬ 
tion  techniques  [l  1,12]  reported  hy  Laws  as  attaining  65%  ac- 
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I  igurc  .3.  The  Hrodatz  textures  used  for  comparison. 


curacy,  as  well  as  wuh  co-occurrence  techniques  [13,14]  reported 
to  he  72‘ r  accurate.  Tliis  superior  performance  was  achieved 
despite  the  large  number  of  texture  features  employed  hy  the 
other  methods 

The  simple  two- parameter  fractal  segmenter  even  compares 
well  with  l.aws's  own  texture  "nergy  statistics;  even  though  his 
segmentation  procedure  included  more  than  a  dozen  texture 
statistics  that  were  optimized  for  the  test  data,  its  theoretical 
segment  at  ion  accuracy  was  only  3^  better.  Thus,  the  results  of 
this  comparison  indicate  that  fractal-based  texture  segmentation 
will  likely  prove  to  be  a  general  and  powerful  technique  (for  more 
details,  see  [lb]). 

3.3  R relationship  To  Texture  Models 

The  fact  that  the  fractal  dimension  of  the  image  data  can  be 
measured  by  using  either  eo-occurrenee  statistics  in  conjunction 
with  hqualion  (I).  or  hy  means  of  the  Fourier  power  spectrum, 
suggests  one  interesting  aspect  of  the  fractal  model;  it  highlights 
a  formal  link  between  co-occurrence  texture  measures  [13,14] 
and  Fourier  techniques  [15,16,17].  The  mathematical  results 
Mandelbrot  derives  for  fractal  Brownian  functions  show  that  the 
way  inlerpixel  differences  change  with  distance  determines  the 
rate  at  which  the  Fourier  power  spectrum  falls  off  as  frequency 
is  increased,  and  vice  versa. 

I  Inis,  it  appears  that  the  fractal  model  offers  potential  for 
unifying  and  simplifying  the  co-oecnrronee  and  Fourier  texture 
descriptions.  If  we  believe  that  natural  surface  textures  and 
their  images  are  fractal  (as  seems  to  be  indicated  by  the  pre¬ 
vious  results),  then  the  fractal  dimension  is  the  most  relevant 
parameter  in  differentiating  among  textures.  In  this  case  we 
would  expect  both  the  Fourier  and  co-occurrence  techniques  to 
provide  reasonable  texture  segmentations,  as  both  yield  sufficient 
information  to  determine  the  fractal  dimension.  The  advantage 
of  the  fractal  model  would  be  that  it  captures  a  simple  physical 
relationship  underlying  the  texture  structure  a  relationship 
lost  with  either  of  the  other  two  characterizations  of  texture. 
Knowledge  of  the  fundamental  physical  principle  can  result  in 
both  increased  computational  efficiency  and  further  insight. 

3.4  Shape  From  Texture 

I  here  are  two  ways  surface  shape  is  reflected  in  image  tex¬ 
ture  (1)  projection  foreshortening,  a  function  of  the  angle  he- 
tween  the  viewer  and  the  surface  normal,  and  (2)  the  perspec¬ 


tive  texture  gradient  that  is  due  to  increasing  distance  between 
the  viewer  and  the  surface.  These  two  phenomena  are  indepen¬ 
dent  in  that  they  have  separate  causes  Thus,  they  can  serve  to 
confirm  each  other  i.e.,  if  projection  foreshortening  is  used  to 
estimate  surface  tilt,  that  estimate  is  independently  confirmed 
if  there  is  a  texture  gradient  of  the  proper  magnitude  and  same 
direction  [17,18],  Ur  may  be  confident  our  estimate  is  correct 
when  such  independent  confirmation  is  found. 

I  he  fractal  dimension  found  in  the  image  appears  to  be 
nearly  independent  of  the  orientation  of  the  surface  (by  virtue 
of  independence  with  respect  to  scale);  therefore  fractal  dimen¬ 
sion  cannot  be  used  to  measure  surface  orientation.  Projection 
foreshortening  docs,  however,  affect  the  variance  of  the  distribu¬ 
tion  /-( y )  associated  with  the  fractal  dimension  (see  Equation 
(1)).  Foreshortening  affects  l  ar(/'(y))  in  exactly  the  manner  it 
affects  the  distribution  of  tangent  direction. 

Thus,  to  estimate  surface  orientation,  we  might  assume  that 
the  surface  texture  is  isotropic  and  estimate  surface  orienta¬ 
tion  on  the  basis  of  previously  derived  results  [18],  While  this 
often  works  [19],  the  necessity  of  assuming  isotropy  is  a  serious 
shortcoming  of  this  technique.  An  important  new  result,  there¬ 
fore.  is  that  we  may  in  part  cure  this  prohlem  hy  observing  the 
fractal  dimensions  in  the  i  and  y  directions.  If  they  are  unequal 
we  have  primt\  facie.  evidence  of  anisotropy  in  the  surface  tex¬ 
ture,  because  fractal  dimension  is  unaffected  hy  projection. 

However  a  foreshort ening-derived  estimate  of  surface  orien¬ 
tation  is  produced,  we  may  still  seek  confirmation  of  it  by 
measuring  the  perspective  texture  gradient;  if  confirmation  is 
found,  we  may  be  confident  of  our  estimate.  Such  a  gradient 
appears  in  Figure  2:  the  houses  dwindle  in  size  with  increasing 
distance  from  the  viewer.  Initial  results,  detailed  in  [19],  indi¬ 
cate  that  perspective  texture  gradients  can  be  inferred  from  the 
locally  computed  fractal  dimension. 

This  two  new  results,  i.e.,  the  ability  to  obtain  evidence  of 
surface  texture  anisotropy  and  the  measurement  of  the  perspec¬ 
tive  texture  gradient,  are  extremely  important  because  they 
offer  a  way  to  make  sb  .pe-from-unfamiliar-tcxturc  techniques 
sufficiently  reliable  so  as  to  be  useful.  Development  of  these 
techniques,  therefore,  constitute  an  important  task  for  future 
research. 

3.5  Shading  And  Texture 

f  ractal  functions  with  11  ^  ,1  can  be  used  to  model  smooth 
surfaces  and  their  reflectance  properties  accurately.  When  11  ^ 

0,  the  surface  is  locally  planar,  except  for  small,  random  varia¬ 
tions  described  by  the  function  /’(y)  in  Equation  (!)  If  we  as¬ 
sume  (hat  incident  light  is  reflected  at  the  angle  of  incidence  and 
we  make  the  variance  of  /'(y)  small  relative  to  the  pixel  size,  the 
surface  will  be  mirrorlike.  If,  on  the  other  hand,  the  variance  of 
/■'(y)  is  large  relative  to  the  pixel  size,  the  surface  will  become 
more  Lambertian. 

I  he  fractal  model,  therefore,  is  a  single  model  that  can  ac¬ 
count  for  both  image  shading  and  texture,  with  shading  cor¬ 
responding  to  the  limiting  value  of  II.  The  fractal  model  thus 
allows  us  to  make  a  reasonable  and  rigorous  definition  of  the  cat¬ 
egories  "texture'’  and  “shading.”  in  terms  that  can  be  measured 
by  using  the  image  data.  One  important  goal  of  future  research 
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will  be  (o  discover  similarities  or  differences  between  these  two 
categories;  initial  results  indicate  that  local  shape-from-shading 
results  [20]  can  be  generalized  to  include  shape-from-texturc. 

4.  COMPUTING  A  DESCRIPTION 

Current  methods  for  representing  the  three-dimensional 
world  suffer  from  a  certain  awkwardness  and  inflexibility  that 
makes  them  difficult  to  envisage  as  the  basis  for  human- 
performance-level  capabilities.  They  have  encountered  prob¬ 
lems  in  dealing  with  partial  knowledge  or  uncertain  information, 
and  they  become  implausibly  complex  when  confronted  with  the 
problem  of  representing  acrumplcd  newspaper,  a  clump  of  leaves 
or  a  pnITy  cloud  furthermore,  they  seem  ill-suited  to  solving  the 
problem  of  representing  a  class  of  objects,  or  determining  that 
a  particnl  ir  object  is  a  member  of  that  class. 

What  is  wrong  with  conventional  shape  representations? 
One  major  problem  is  that  they  make  too  much  information 
explicit.  Experiments  in  human  perception  [21]  lead  one  to 
believe  that  our  represeutation  of  a  crumpled  newspaper  (for 
instance)  is  not  accurate  enough  to  recover  every  c  value;  rather, 
it  seems  t  hat  we  remember  the  general  “crumplcdness"  and  a  few 
of  the  major  features,  such  as  the  general  outline.  The  rest  of 
the  newspaper's  detailed  structure  is  ignored;  it  is  unimportant, 
random. 

from  the  point  of  view  of  constructing  a  representation,  the 
only  important  constraints  on  shape  are  the  crumpledness  and 
general  outline.  What  we  would  like  to  do  is  somehow  capture 
the  notion  of  constrained  chance,  that  is,  the  intuition  that  “a 
crumpled  newspaper  has  z,  y  and  z  structural  regularities  and 
the  rest  is  just  variable  detail,"  thus  allowing  us  to  avoid  dealing 
with  ineonsccpicntial  (random)  variations  and  to  reason  instead 
only  about  the  structural  regularities. 

4.1  The  Process  Of  Computing  A  Description 

How  shall  we  go  about  computing  such  a  “constrained 
chance  description?’  I  et  us  consider  the  problem  formally  and 
see  where  that  leads  us.  The  process  of  computing  a  shape 
description  (given  some  sensory  data)  seems  best  characterized 
as  attempting  to  confirm  or  deny  such  hypotheses  as  “shape  z 
is  consistent  with  these  sense  data."  Computation  of  a  shape 
description,  therefore,  seems  to  be  a  problem  in  induction  [20], 

If.  naively,  we  try  to  use  an  inductive  method,  we  start 
with  the  set  of  all  possible  shape  hypotheses;  we  then  attempt 
to  winnow  the  set  down  to  a  small  number  of  hypotheses 
that  are  confirmed  by  the  sensory  data.  The  “set  of  all 
shape  hypotheses,"  however,  is  much  too  large  to  work  with. 
Consequently,  we  must  take  a  slightly  different  tack. 

Using  the  notion  of  constrained  chance.  Rather  than 
attempting  to  enumerate  “all  shape  hypotheses”  explicitly,  let  us 

The  term  representation"  will  be  used  to  refer  to  the  scheme 
for  representing  shapes,  while  the  term  “description”  will  be 
reserved  for  specific  instances.  Thus  one  can  compute  a  descrip¬ 
tion  of  some  object;  it  will  be  a  member  of  the  class  of  shapes 
that  can  be  accounted  for  withiu  the  representation. 


instead  construct  a  shape  generator  that  uses  a  randout  number 
generator  to  produce  a  surface  shape  description  (I  shall  shortly 
describe  how  to  do  this).  If  we  were  to  run  this  shape  generator 
for  an  infinite  period,  it  would  eventually  produce  instances  of 
every  shape  within  a  large  class  of  shapes.  If  the  generator  were 
so  constructed  that  the  class  of  shapes  produced  was  exactly  the 
set  of  “all  hypotheses"  about  shape,  then  the  program  for  the 
shape  generator,  together  with  a  the  program  for  the  random 
number  generator,  would  comprise  a  description  of  the  set  of  all 
shape  hypotheses. 

The  shape  generator  illustrates  how  the  notion  of  con¬ 
strained  chance  may  be  used  to  obtain  a  compact  description 
of  an  infinite  set  of  shapes.  By  changing  the  constraints  that 
determine  how  the  output,  of  the  random  number  generator 
is  translated  into  shape,  we  can  change  the  set  of  shapes 
described;  specifically,  we  ran  introduce  constraints  tliat  rule 
out  some  classes  of  shape  and  thus  restrict  the  set  of  shapes  that 
arc  described.  The  ability  to  progressively  restrict  the  set  of 
shapes  described  allows  us  to  use  the  constrained-chance  shape 
generator  as  the  basis  for  induction,  rather  than  being  forced  to 
use  the  explicitly  enumerated  set  of  all  shape  hypotheses. 

The  process  of  computing  a  “constrained  chance  descrip¬ 
tion'  is  straightforward.  We  use  image  data  to  infer  (using 
knowledge  of  the  physics  of  image  formation)  constraints  on 
the  shape,  and  then  introduce  those  constraints  into  the  shape 
generator.  The  end  result  will  be  a  programlike  description  that 
is  capable  of  producing  all  the  shapes  that,  are  consistent  wi*h 
the  image  data;  i.e.,  we  shall  have  a  description  of  the  shapes 
confirmed  by  the  image  data.  This,  then  is  the  type  of  descrip¬ 
tion  we  wanted:  a  description  of  shape  that  contains  the  impor¬ 
tant  structural  regularities  that  can  he  inferred  from  the  image 
(e  g.,  crumpledness,  outline),  but  one  that  leaves  everything  else 
as  variable,  random. 

Some  people  are  already  doing  this.  Something  very 
much  like  this  constrained-chance  representation  is  already  being 
widely  utilized  in  the  romputer  graphics  community.  Natural- 
looking  shapes  arc  produced  by  a  simple  fractal  program  that 
recursively  subdivides  the  region  to  be  filled,  introducing  ran- 
dom  jaggedness  of  appropriate  magnitude  at  each  step  [3,5]. 
The  jaggedness  is  determined  by  specifying  the  fractal  dimen¬ 
sion.  The  shapes  that  can  be  produced  in  this  manner  range 
from  planar  surfaces  to  mountainlike  shapes,  depending  on  the 
fractal  dimension.  Current  graphics  technology  often  employs 
fractal  shape  generators  in  a  more  constrained  mode;  often  the 
overall,  general  shape  or  the  boundary  conditions  are  specified 
beforehand.  Thus,  a  srene  is  often  constructed  by  first  specify¬ 
ing  initial  constraints  on  the  general  shape,  and  then  using  a 
fractal  shape  generator  to  fill  in  the  surface  with  appropriately 
jagged  (or  smooth)  details.  The  description  employed  in  such 
graphics  systems,  therefore,  is  exactly  a  constrained-chance 
descript  ion:  important  details  arc  specified,  and  everything  else 
is  left  unspecified  except  in  a  qualitative  manner. 

This  type  of  description  bears  a  close  relationship  to  surface 
interpolation  methods  (e.g„  [24]).  Typically,  such  schemes  fit  a 
smooth  surface  that  satisCes  whatever  boundary  conditions  are 
available.  The  initial  boundary  conditions,  together  with  the 
interpolation  function,  constitute  a  precise  description  of 
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the  surf  icc  shape  Such  schemes  are  limited  to  smooth  sur¬ 
faces  however,  and  therefore  are  incapable  of  dealing  with  most 
n  il  nral  shapes  In  contrast ,  a  fractal-based  representation  allows 
either  rough  or  smooth  surfaces  to  be  fit  to  the  initial  boundary 
conditions  depending  upon  tbe  fractal  dimensiou.  This  method 
of  description  therefore  is  quite  capable  of  describing  most 
natural  surfaces  and  that  is  why  the  graphics  community  is 
turning  to  the  use  of  fractal-based  descriptions  for  natural  sur¬ 
faces. 

In  order  to  make  use  of  tbis  type  of  descriptiou  it  is  neces¬ 
sary  to  be  able  to  specify  the  surface  shape  iu  a  qualitative 
manner,  i.c  how  rugged  is  the  topography?  This  specification 
of  qualitative  shape  ran  be  accomplished  by  fixing  the  fractal 
dimension.  The  fart  that  we  have  recently  developed  a  method 
of  inferring  the  fractal  dimension  of  the  3-D  surface  directly  from 
the  image  data  means  that  we  are  now  able,  for  the  first  time, 
to  actually  compute  a  fractal  or  constrained-chance  description 
of  a  re  il  scene  from  its  image. 

Not  only  terrestrial  topography  has  been  modeled  by  use 
of  a  const rained-chaiicc  representation  but  also  clouds,  ponds, 
riverbeds,  snow  Hakes,  ocean  surf  and  stars,  just  to  name  a  few 
examples  [l  ,.'i.  1.5, 6, 7]  Researchers  have  also  used  constrained- 
chance  generators  to  produce  plant  shapes  [1,1,0].  A  very 
natural-looking  tree  ran  be  produced  by  recursively  applying 
a  random  number  generator  and  simple  constraints  on  branch¬ 
ing  geometry.  In  each  rase  a  random  number  generator  plus  a 
surprisingly  small  number  of  constraints  can  be  used  to  produce 
very  good  models  of  apparently  complex  natural  phenomena. 
Thus,  there  is  hope  for  extending  this  approach  well  beyond  the 
domain  of  land  topography. 

4.2  An  Example  Of  Computing  A  Description 

figure  I  illustrates  an  actual  example  of  computing  such 
a  description.  Figure  1(a)  is  an  image  of  a  real  mountain.  Let 
us  suppose  that  we  wished  to  use  the  image  data  to  construct 
a  three-dimensional  model  of  the  rightmost  peak  (arrow),  per¬ 
haps  for  the  purpose  of  predicting  whether  or  not  we  could  climb 
it.  I  will  take  the  standard  fractal  technology  used  in  the  com¬ 
puter  graphics  community  as  the  unconstrained  “ptimal”  shape 
generator,  as  it  provides  an  apparently  accurate  model  of  a  wide 
range  of  natural  surfaces. 

\ll  that  is  necessary  to  construct  a  description  of  this  moun¬ 
tain  peak  is  to  extract  shape  constraints  from  the  image  and 
insert  them  into  the  primal  shape  generator.  The  fractal  dimen¬ 
sion  of  the  3-1)  surface  is  the  principal  parameter  (constraint) 
required  by  our  fractal  shape  generator  roughly  speaking,  it 
determines  the  ruggedness  of  the  surface.  The  fractal  dimen¬ 
sion  of  the  3-1)  surface  in  the  region  near  the  rightmost  peak 
was  inferred  from  the  fractal  dimension  of  the  image  intensity 
surface  in  that  area  [19],  Constraint  on  the  general  outline 
of  this  peak  was  derived  from  distinguished  points  (those  with 
high  curvature)  along  the  boundary  between  sky  and  mountain. 
These  two  constraints,  together  with  the  shape  generator,  are 
a  3-D  representation  of  this  peak;  the  question  is:  how  good 
a  representation?  A  view  of  a  3-D  model  derived  from  this 
representation  is  shown  in  Figure  4(b).  It  appears  that  these 


I  igure  t  All  example  of  computing  a  constrained-chance  descrip- 
lion. 

simple  constraints  arc  sufficient  for  computing  a  good*  3-D  rep¬ 
resentation  of  the  peak. 

4.3  What  Do  VVe  Accomplish  With  This  Approach/ 

Let's  consider  the  problems  cited  above: 

(I)  The  problem  of  representing  a  complex  shape,  such  as 
a  crumpled  newspaper.  The  problem  with  a  shape-primitive 
representation  such  as  surface  normals,  voxcels  or  generalized 
cylinders  is  that  the  resulting  description  seems  hopelessly  com¬ 
plex.  Because  the  constrained-chance  representation  allows  us 
to  deal  only  with  the  structural  regularities  and  to  ignore  in¬ 
consequential  details,  the  problem  can  become  much  simpler. 
Thus,  for  instance,  the  graphics  community  has  found  that 
const  ruined-chance  fractal  descriptions  of  complex  objects  (e.g., 
a  mountain)  are  quite  compact  and  easy  to  manipulate.  It  also 
•urns  out  that  many  previously  simple  things,  such  as  describing 
a  smooth  plane,  remain  simple. 

Ilow  does  t  his  representation  function  when  we  want  to  com¬ 
pute  a  description  of  a  xpccific  mountain,  bush  or  other  entity 
from  its  image?  Current  “shape-from-r1’  research  furnishes  con¬ 
straints  on  shape  in  a  variety  of  forms:  surface  orientation  (from 
texture  [15  18.25],  shading  [22,23,26]),  relative  depth  (from 

motion  [27,28],  contour  (29  —  31]),  and  absolute  depth  (from 
stereo  [32  34],  egoinotion  [35,36]).  It  appears  to  be  fairly 

straightforward  to  mix  each  of  the  various  flavors  of  constraint 
into  the  vanilla-flavor  shape  generator  [3,5],  although  significant 
research  remains  to  be  done.  As  more  shape  constraints  are  ob¬ 
tained  from  the  image,  the  description  becomes  more  and  more 
precise;  i.e..  there  is  less  and  less  chance  in  the  description. 

Rather  primitive  ray  tracing,  etc.,  was  used  to  generate  this 
image;  better  code  is  being  implemented. 
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I  veiiluallv  only  one  shape  satisfies  all  of  the  constraints. 

iloyv  complex  could  such  a  description  become?  The 
constrained-chance  representation  would  at  worst  be  as  complex 
as  a  two-dimensional  array  of  ;  values  representing  the  same 
surface,  because  we  could  always  use  it  to  actually  generate  such 
in  array  of  c  values.  \s  mentioned  previously,  experiments  in 
human  perception  indicate  that  our  representations  are  usually 
not  accurate  enough  to  recover  every  c  value  The  representation 
of  a  part icnlar  object  I  hcrefore,  is  likely  to  be  quite  a  bit  simpler 
t  han  i  full  dept  It  map 

(2)  I  he  problem  of  representing  classes  of  shapes,  such 
as  are  referred  to  In  the  terms  “a  mountain,’  or  “a  bush." 
\gain.  the  ability  to  specify  important  structural  details  and 
leave  the  rest  only  qualitatively  constrained  allows  simplification 
of  the  problem  The  definition  of  “a  mountain,"  for  instance, 
might  reasonably  consist  entirely  of  a  specification  of  the  fractal 
dimension  of  the  surface  and  a  caveat  concerning  sire.  If  yve 
are  to  judge  by  the  results  reported  in  the  computer  graphics 
literature,  the  notion  of  representation  by  constrained  chance 
thus  allows  ns.  using  only  a  few  lines  of  code,  to  produce  an 
accurate  dcscriptiou  of  the  class  of  shapes  we  label  “mountains," 
or  “bush." 

(•I)  1  he  problem  of  determining  the  set  of  appropriate 
descriptions  when  the  shape  is  underconstrained  by  the  sense 
data.  I  he  problem  with  standard  shape-primitive  repre¬ 
sentations  is  that  either  we  must  generate  all  combinations  of 
shape  primitives  consistent  with  the  sense  data  (a  very  hard 
problem),  or  pirk  a  prototype  and  specify  error  bounds.  The 
problem  with  using  prototypes  plus  error  bounds  is  that  we  are 
forced  to  ovorcominit  ourselves  by  choosing  the  prototype;  e.g., 
there  is  something  seriously  wrong  about  describing  a  cube  as 
“a  sphere  ±0.lr',  even  though  the  cube  certainly  fits  within  the 
specified  volume 

Because  the  constrained-chance  representation  allows 
details  to  be  left  constrained  but  unspecified,  it  allows  us  to  deal 
with  insufficient  sense  data  by  simply  adding  in  those  constraints 
that  can  be  deduced  from  the  image  data  and  committing  our¬ 
selves  no  further.  The  result  is  a  programlike  description  that 
can  be  analyzed  and  manipulated,  does  not  overcommit  itself  as 
to  object  shape,  and  allows  examples  of  shapes  consistent  with 
the  image  data  to  be  generated  and  examined. 

(I)  The  problem  of  determining  that  a  specific  descrip¬ 
tion  is  a  member  of  a  more  general  class.  Here  the  problem 
with  shape-primitive  representations  is  that  there  is  so  much 
variability  among  the  descriptions  of  the  members  of  a  class  such 
as  'mountain’  that  a  description  of  the  class  as  a  whole  seems 
extremely  difficult,  and  determination  of  class  membership  even 
more  so. 

The  problem  of  establishing  class  membership  by  us¬ 
ing  .oust  ruined-chance  representations  reduces  to  determining 
whether  the  constraints  used  to  specify  a  particular  description 
are  a  subset  of  those  of  the  more  general  class.  A  determination 
regarding  class  membership  is,  therefore,  exactly  equivalent  to 
determining  whet  her  one  program’s  output  is  a  subset  of  another 
program's  output.  While  such  automatic  proof  is  a  difficult 
problem,  it  is  at  least  tractable  and  well-deGncd  unlike  the 
equivalent  problem  can  be  when  using  a  shape-primitive  rep¬ 
resentation.  1  bus,  a  constrained-chance  representation  allows 


a  clear  and  potentially  useful  definition  of  what  . cans  to 

“recognize  that  i  is  an  y.~ 

hurt  her.  because  we  need  only  deal  with  the  structural 
regularities,  I  his  problem  ran  become  much  simpler  than  it  might 
at  first  appear.  Taking  the  class  “a  mountain"  to  be  defined  by 
fractal  dimension  and  overall  size  (a  definition  that  is  actually 
sufficient  to  produce  realistic  mountain  shapes)  we  can,  for  in¬ 
statin',  easily  determine  that  t  he  descript  ion  computed  by  us  for 
the  mountain  peak  is  in  fart  a  description  of  part  of  a  mountain 
a  task  that  previously  seemed  to  be  nearly  impossible. 


5.  SUMMARY 

fractal  functions  seem  to  provide  a  good  model  of  natural 
surface  shapes.  Many  basic  physical  processes  produce  fractal 
surfaces,  fractal  surfaces  also  look  like  natural  surfaces,  and 
so  have  come  into  widespread  uses  in  the  computer  graphics 
community.  I  urlhrriuore,  we  have  conducted  a  survey  of  natural 
imagery  and  found  that  a  fractal  model  of  imaged  3-D  surfaces 
furnishes  an  accurate  description  of  both  textured  and  shaded 
image  regions. 

Fractal  functions,  therefore,  are  useful  for  addressing  the 
related  problems  of  representing  complex  natural  shapes  such  as 
mountains,  and  computing  a  description  of  such  shapes  from 
image  data.  The  following  describes  the  progress  achieved 
toward  the  solution  of  these  problems. 

Computing  a  description.  Characterization  of  image 
texture  by  means  of  a  fractal  surface  model  has  shed  considerable 
light  on  the  physical  basis  for  several  of  the  texture  techniques 
currently  in  use,  and  made  it  possible  to  describe  image  texture 
in  a  manner  that  is  stable  over  transformations  of  scale  and 
linear  transforms  of  intensity.  These  properties  of  the  fractal 
surface  model  allow  it  to  serve  as  the  basis  for  an  accurate  image 
segmentation  procedure  that  is  stable  over  a  wide  range  of  scales. 

Because  fractal  dimension  is  not  aITtcted  by  projection  dis¬ 
tortion,  its  measurement  can  significantly  enhance  our  ability 
to  estimate  shape  from  (unfamiliar)  texture.  Specifically,  it 
seems  that  measurement  of  fractal  dimension  can  provide  (I) 
evidence  of  surface  texture  anisotropy,  and  (2)  an  estimate  of 
the  perspective  texture  gradient  Both  capabilities  are  extremely 
important  because  they  provide  a  way  to  obtain  independent 
confirmation  of  the  assumptions  on  which  previously-repor'ed 
[18]  shape-from- unfamiliar-text  ure  techniques  are  based. 

Representing  natural  shapes.  A  constrained-chance 
representation  modeled  after  the  fractal  techniques  used  by 
the  graphics  community  seems  useful  for  representing  complex 
natural  shapes,  such  as  a  crumpled  newspaper  or  a  moun¬ 
tain.  The  problem  encountered  when  using  conventional  shape- 
primitive  representations  to  describe  natural  surfaces  is  that  the 
resulting  description  is  often  hopelessly  complex.  Because  the 
constrained-chance  representation  allows  us  to  deal  only  with 
the  structural  regularities  and  to  ignore  inconsequential  details, 
the  problem  can  heroine  much  simpler.  Thus,  for  instance,  the 
graphics  community  has  found  that  eonstrained-ehance  fractal 
descriptions  of  complex  objects  (e.g.,  a  mountain)  are  quite  com¬ 
pact  and  easy  to  manipulate.  Similarly,  the  problem  of  repre- 
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sciiling  r/a.s.se.s  »f  shapes,  such  as  arc  referred  lo  by  the  terms 
"a  mount  tin,”  or  a  bush,”  can  also  be  significantly  simplified 

The  encouraging  progress  that  has  already  been  achieved  on 
both  of  these  problems  augers  well  for  this  approach.  It  appears 
tint  a  constrained-chance  representation  incorporating  a  fractal 
model  of  surface  shape  w  ill  proC  !e  an  elegant  solution  for  some 
of  the  most  dillirull  problems  encountered  when  attempting  to 
progress  from  the  image  of  a  natural  scene  to  its  description. 
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A  rule  bused  image  interpretation  system  is  presented 
which  is  shown  to  be  effective  in  interpreting  complex 
outdoor  scenes.  The  system  utilizes  world  knowledge  to 
reduce  the  ambiguities  in  image  measurements  obtained 
from  simple  interpretation  rules.  These  rules  involve 
sets  of  partially  redundant  features  each  of  which 
defines  an  area  of  feature  space  v/hich  represents  a 
"vote”  for  an  object.  The  features  include  color, 
texture,  shape,  size,  image  location,  and  relative 
location  to  other  objects.  Convergent  evidence  from 
multiple  interpretation  strategies  is  organized  by 
top-down  control  mechanisms  in  the  context  of  a  partial 
interpretation.  One  such  strategy  extends  a  kernel 
interpretation  derived  through  the  selection  of  object 
exemplars  and  regions  which  represent  the  most  reliable 
image  specific  hypotheses  of  a  general  object  class. 
The  use  of  exemplar  strategies  and  other  top-down 
strategies  results  in  the  extension  of  partial 
interpretations  from  islands  of  reliability. 

1.  Introduction 


The  use  of  world  knowledge,  together  with  top 
down  control,  is  beneficial  and  probably  essential  in 
domains  where  uncertain  data  and  intermediate  results 
containing  errors  cannot  be  avoided.  The  task  of 
’’understanding”  images  of  unconstrained  natural  scenes 
is  such  a  domain.  Ambiguity  and  uncertainty  in  image 
interpretation  tasks  arise  from  many  sources,  including 
the  inherent  variation  of  objects  in  the  natural  world 
(e.g.,  the  size,  shape,  color  and  texture  of  trees),  the 
ambiguities  arising  from  the  perspective  projection  of 
the  3D  world  onto  a  2D  image  plane,  occlusion, 
changes  in  lighting,  changes  in  season,  image  artifacts 
introduced  by  the  digitization  process,  etc.  These 

1.  This  research  was  supported  in  part  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  number 
N00014-82-K-046,  the  National  Science  Foundation  under  gram 
number  MCS-7918209,  and  by  the  Air  Force  Office  of 
Scientific  Research. 


difficulties  are  compounded  by  the  lack  of  precise 
mechanisms  or  theories  of  visual  processes  that  would 
allow  the  accurate  and  reliable  extraction  of  important 
image  events.  Nevertheless,  many  real  scenes  contain 
numerous  examples  of  instances  in  which  human 
observers  can  infer  the  presence  and  location  of  objects 
from  marginal  bottom-up  information. 

Attempts  in  computer  vision  research  to  contend 
with  these  issues  involve  the  integration  of  the  results 
of  analysis  of  different  aspects  of  the  visual  data  (such 
as  color,  texture,  shape,  perspective,  stereopsis,  motion, 
etc.),  and  from  overlapping  local  contextual  and 
environmental  constraints.  We  will  present  a  system 
here  that  uses  a  large  amount  of  stored  knowledge  to 
carry  out  the  image  interpretation  task.  One  can  only 
expect  these  results  to  be  improved  by  more  effective 
low-ievel  processes  than  those  presented  in  this  system. 

Early  systems  demonstrated  that  task-specific 

knowledge  could  be  used  to  advantage  in  image 
interpretation  (for  example  [9,  10,  and  11J.  Recent 
research  has  been  directed  towards  developing 
increasingly  general  representations  of  the  world 

knowledge  needed  for  image  understanding.  For 

example,  generalized  cylinders,  a  fairly  robust 
representation  of  form,  are  used  in  the  system 
developed  by  Brooks  [2].  In  this  representation,  the 
metric  relations  among  objects  and  within  object 
descriptions  are  parameterized.  The  image 

interpretation  process  builds  a  graph  representaiton  of 
the  particular  image,  and  also  fixes  bounds  on  those 
free  parameters  needed  for  interpretation.  However, 
the  object  representation  and  the  matching  process  do 
not  make  use  of  any  image  features  other  than  ribbons, 
a  linked  set  of  edges  that  are  candidates  for  projections 
of  generalized  cylinders. 

The  systems  developed  by  Ohla  [7]  for 
understanding  images  of  buildings  in  outdoor  settings 
and  by  Nagao  [6]  for  understanding  aerial  photographs 
make  much  wider  use  of  the  image  data.  In  both 
these  systems  the  objects  are  described  in  terms  of 
their  possible  image  appearance;  these  descriptions 
include  both  spectral  and  spatial  features.  Both  systems 
have  a  very  rich  description  of  appearance  but  little 
description  of  form.  The  system  developed  by  Nagao 


IS  of  special  interest  here,  because  of  his  use  of 
’’characteristic  regions.”  By  identifying  those  regions  that 
can  be  given  a  tentative  identity  (as,  say,  ”a  large 
textured  region”)  with  a  high  degree  of  certainty  and 
by  subsequently  associating  a  label  with  those  regions 
("forest”),  the  identification  process  can  build  up 
’’islands  of  certainty”  that  yield  information  about  the 
appearance  of  specific  objects  in  the  image.  This  is 
similar  to  the  concept  of  object  exemplars  described  in 
Section  4.  A  review  of  these  and  other  related  work 
appears  in  [1]. 

In  this  paper  the  interpretation  task  examined  is 
that  of  labelling  an  initial  region  segmentation  of  an 
image  with  object  (and  object  pari)  labels,  when  the 
image  is  known  to  be  a  member  cr  a  restricted  class 
of  scenes  (e.g.,  suburban  house  scenes).  An  important 
aspect  of  this  task  is  the  effective  use  of  scene/image 
knowledge  in  the  interpretation  process,  particularly  on 
methods  and  techniques  for  aggregating  and  mapping 
preliminary  region,  boundary,  and'or  surface  data  into 
more  abstract  descriptions.  The  results  discussed  in 
Section  5  were  obtained  from  a  version  of  the 
VISIONS  system  configured  with  a  region  segmentation 
system,  a  knowledge  network,  such  as  the  one  shown  in 
Figure  1,  a  collection  of  interpretation  "rules”,  and  a 
set  of  interpretation  ’’strategies”. 


2.  A  Knowledge  Network  and  Representation  Using 
Schemata 

Description  of  scenes,  at  various  levels  of  detail, 
are  captured  in  a  set  of  schema  hierarchies  [4].  A 

schema  graph  is  a  data  structure  defining  an  expected 
collection  of  objects,  such  as  a  house  scene,  the 

expected  visual  attributes  associated  with  the  objects  in 
the  schema  (each  of  which  can  have  an  associated 
schema),  and  the  expected  relations  among  them.  For 

example,  a  house  (in  a  house  scene  hierarchy)  has  roof 
and  house  wall  as  sub-parts,  and  the  house  wall  has 
windows,  shutters,  and  doors  as  sub-parts.  The 
knowledge  network  of  Figure  1  is  a  simplified 
version  of  a  schema  hierarchy  as  developed  in  [8J.  Each 
schema  node  (e.g.  house,  house  wall,  and  roof)  has 

both  a  structural  description  appropriate  to  the  level  of 
detail  and  methods  of  access  to  a  set  of  recognition 
and  verification  strategies  called  interpretation  strategies. 
For  example,  the  sky-object  schema  (associated  with  the 
outdoor-scene  schema)  has  access  to  the  exemplar 
selection  and  extension  strategy  discussed  below. 

In  general,  the  information  available  about  any 
scene  component  falls  into  one  of  three  classes: 
knowledge  of  form,  of  spectral  characteristics,  and  of 
plausible  relations  with  other  objects.  Interpretation 
rules  relate  image  events  to  knowledge  events  by 
providing  evidence  for  or  against  part/sub-part 
hypotheses.  An  interpretation  strategy,  associated  with 


a  schema  node,  specifies  how  specific  interpretation 
rules  may  be  applied,  and  how  combined  results  from 
multiple  rules  may  be  used  to  decide  whether  or  not  to 
’’accept”  (i.e.,  instantiate)  an  object  hypothesis.  The 
interpretation  strategy  thus  represents  both  control  local 
to  the  node  and  top-down  control  over  the  instantiation 
process. 

Note  that  the  goal  is  not  to  have  these 
interpretation  rules  and  strategies  extract  exactly  the 
correct  set  of  regions.  Our  philosophy  is  to  allow 
incorrect,  but  reasonable,  hypotheses  to  be  made  and  to 
bring  to  bear  other  knowledge  (such  as  various 
similarity  measures  and  spatial  constraints)  to  filter  the 
incorrect  hypotheses.  An  example  of  such  error 
detection  and  correction  in  the  interpretation  process 
will  be  given  in  Section  5. 

3.  Rule  Form  for  Object  Hypotheses  Under  Uncertainty 

Important  schema  attributes  of  objects  include 
features  such  as  color,  texture,  shape,  relative  size 
measurements,  and  expected  spatial  relationships  with 
other  objects,  object  parts,  and  the  image  frame. 
Unfortunately,  the  large  variations  observed  in  image 
features  and  the  significant  overlap  of  feature 
distributions  across  images  preclude  the  use  of  standard 
pattern  classification  approaches  to  the  problem  of 
characterizing  a  set  of  measured  features  by  an  object 
label.  This  approach  produces  a  large  number  of 
false-positive  responses  as  indicated  in  Figure  2  for  an 
’’excess  green”  feature  (2G-R-B). 

We  propose  an  approach  to  object  hypothesis 
formation  which  is  both  simple  and  effective.  It  relies 
on  convergent  evidence  from  a  variety  of  measurements 
and  expectations.  For  example,  in  an  outdoor  scene 
taken  with  a  camera  in  standard  position,  one  would 
expect  grass  to  be  of  medium  brightness,  to  have  a 
significant  green  component,  to  be  located  somewhere 
in  the  lower  portion  of  the  image,  etc.  2  These 

expectations  can  be  translated  into  a  strategy  which 

combines  the  results  of  many  measurements  into  a 
confidence  level  that  the  region  (or  meta-region) 
represents  grass. 

We  will  illustrate  the  form  of  a  simple 

interpretation  rule  based  on  using  the  expectation  that 
grass  is  green.  The  feature  used  is  average  excess 

green  for  the  region,  obtained  by  computing  the  mean 
of  2G-R-B  for  all  pixels  in  this  region.  Histograms  of 

2.  Note  that  a  camera  model  and  access  to  a 
3D  representation  of  the  environment  could  dynamically 
modify  the  value  of  these  location  limits  in  the  image; 
thus,  the  system  would  modify  expectations  as  it  orients 
the  camera  up  or  down  relative  to  the  ground  plane. 
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this  feature  are  shown  in  Figure  2,  comparing  all 
regions  to  all  known  grass  regions  across  8  samples  of 
color  outdoor  scenes.  An  abstract  version  is  shown  in 
Figure  3  The  basic  idea  is  to  form  a  mapping  from  a 
measured  value  of  the  feature  obtained  from  an  image 
region,  say  fj,  into  a  "vote”  for  the  object  on  the  basis 
of  this  single  feature  One  approach  to  defining  this 
mapping  is  based  on  the  notion  of  prototype  vectors 
and  the  distance  from  a  given  measurement  to  the 
prototype,  a  well-known  technique  which  extends  to 
N -dimensional  feature  space  [4J.  In  our  case  rather 
than  using  this  distance  to  "classify”,  we  translate  it 
into  a  "vote”. 


Let  d(fp,fj)  be  the  distance  between  the 
prototype  feature  point  fp  and  the  measured  feature 
value  fj.  The  response  R  of  the  rule  is  then 


P(fl)  = 


O, 


I  if  d(fpii) 

\  “  ei  <  d(Wi>  *  °2 

0  if  f>2  <  d(fp/,)  s,  03 

-«  if  63  <  d(fp/j) 


Tlie  thresholds  0j,02>  and  0^  represent  a  gross 
interpretation  of  the  distance  measurements.  63  allows 
strong  negative  votes  if  the  measured  feature  value 
implies  that  the  hypothesized  object  cannot  be  correct. 
For  example,  fairly  negative  values  of  the  excess  green 
feature  imply  a  color  which  should  veto  the  grass  label. 
Thus,  certain  measurements  can  exclude  object  labels; 
this  proves  to  be  a  very  effective  mechanism  for 
filtering  many  spurious  weak  responses.  Of  course  there 
is  the  danger  of  excluding  the  proper  label  due  to  a 
single  feature  value,  even  in  the  face  of  strong  support 
from  many  other  features.  In  the  actual  implementation 
of  this  rule  form,  ftj,  ©2>  an^  e3  are  replaced  with  six 
values  so  that  non-symmetric  rules  may  be  defined  as 
shown  in  Figure  4.  There  are  many  ways  to  combine 
the  individual  feature  responses  into  a  score;  here  we 
have  used  a  simple  weighted  average. 


4.  Exemplars  and  Islands  of  Reliability 

The  extreme  variations  that  occur  across  images 
can  be  compensated  for  somewhat  by  utilizing  an 
adaptive  strategy.  This  approach  is  based  on  the 
observation  that  the  variation  in  the  appearance  of 
objects  (region  feature  measures  across  images)  is  much 
greater  than  object  variations  within  an  image  (see 
Figure  2). 


object  characteristics  in  the  image  and  not  on  the 
relationship  to  other  hypotheses.  The  most  reliable 
object  hypotheses  can  be  formed  using  interpretation 
rules  based  on  prototype  matching  and  this  can  be  the 
basis  of  adaptation.  A  largely  incomplete  kernel 
interpretation  is  formed  based  on  the  most  reliable  of 
these  hypotheses;  this  forms  the  initial  context  for 
further  interpretation  strategies.  One  such  strategy 
extends  the  kernel  interpretation  by  using  the  features 
of  labelled  regions  (color,  texture,  shape,  location,  and 
size)  as  "exemplars”  (new  prototypes)  which  can  be 
used  to  select  and  label  other  regions  of  the  same 
identity.  This  is  similar  to  the  method  in  [6],  where 
"characteristic  regions”  were  used  to  guide  hypothesis 
formation  in  the  early  stages  of  interpretation.  Finally  a 
verification  phase  can  be  applied  where  relations 
between  object  hypotheses  are  examined  for  consistency. 
Thus,  the  interpretation  is  extended  through  matching 
and  processing  of  region  characteristics  as  well  as 
semantic  inference. 

Exemplar  hypothesis  regions  are  selected  by  a 
rule  of  the  general  form  described  in  section  3.  The 
goal  is  to  find  a  representative  region  that  matches  as 
closely  as  possible  the  predefined  template.  Once 
found  this  region  (or  set  of  regions)  can  be  used  to 
define  an  image  specific  template  (perhaps  in  a 
different  sub-space  of  the  feature  space  than  was  used 
to  select  it).  Exemplar  hypotheses  differ  from  general 
hypothesis  rules  in  that  they  are  more  conservative; 
they  should  minimize  the  number  of  false  hypotheses  at 
the  risk  missing  true  target  regions  by  narrowing  their 
range  of  acceptable  responses.  If  all  regions  are 
vetoed,  secondary  strategies  are  invoked;  for  example, 
the  veto  ranges  can  be  relaxed,  admitting  less  reliable 
exemplars.  Figure  5  compares  the  results  of  the  grass 
exemplar  rule  with  the  general  grass  hypothesis  rule. 
The  strategy  can  also  be  used  to  generate  lists  of 
hypotheses  ordered  by  reliability. 

The  advantages  of  using  object  exemplars  include: 

1)  an  effective  means  for  extending  reliable 
hypotheses  to  regions  which  are  more  ambiguous; 
this  is  similar  to  the  notion  of  "islands  of 
reliability”  [3]; 

2)  a  knowledge -directed  technique  for  partially 
dealing  with  the  unavoidable  region  fragmentation 
that  occurs  with  any  segmentation  algorithm  or 
low-level  image  transformation/grouping;  regions 
that  are  "similar”  to  the  exemplar  can  be  both 
labelled  and  merged;  similarity  criteria  can  be 
context-sensitive  so  that  regions  will  be  compared 
to  the  exemplar  in  terms  of  the  range  of  each 
feature  of  that  object; 


'*• 


In  tue  initial  stages,  there  are  few  if  any  image 
hypotheses,  and  development  of  a  partial  interpretation 
must  rely  primarily  on  general  knowledge  of  expected 
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.'»)  exemplars  play  a  natural  role  in  the 
implementation  of  an  hypothesize-and-verify 


control  strategy;  hypotheses  are  formed  based 
upon  initial  feature  information  and  subsequently 
can  be  used  in  a  verification  process  where  the 
relationship  between  labelled  regions  provides 
consistency  checks  on  the  hypotheses  and  the 
evolving  interpretation. 

Let  us  briefly  consider  a  few  of  the  many  ways 
that  exemplars  can  be  used  to  extend  object  hypotheses. 
The  similarity  of  region  color  and  texture  can  be  used 
to  extend  an  object  label  to  other  regions,  possibly 
under  spatial  constraints.  Thus,  a  sky  exemplar  region 
would  be  restricted  to  comparisons  with  regions  above 
the  horizon  which  look  similar  to  the  largest,  bluest 
region  located  near  the  top  of  the  picture.  A  house 
wall  showing  through  foliage  can  be  matched  to  the 
unoccluded  visible  portion  based  upon  color  similarity 
and  spatial  constraints  derived  front  inferences  from 
house  wall  geometry. 

The  shape  and/or  size  of  a  region  can  be  used  to 
detect  other  instances  of  multiple  objects,  as  in  the 
case  of  finding  one  shutter  or  window  of  a  house,  or 
one  tire  of  a  car,  or  one  car  on  a  road.  In  many 
situations,  multiple  instances  of  an  object  can  be 
expected  to  have  a  similar  size  and  shape.  This, 
together  with  constraints  on  the  image  location,  permits 
reliable  hypotheses  to  be  formed  even  with  high 
degrees  of  partial  occlusion.  If  one  is  viewing  a  house 
from  a  viewpoint  approximately  perpendicular  to  the 
front  wall,  other  shutters  can  be  found  via  the  presence 
of  a  single  shutter  since  there  are  also  strong  spatial 
constraints  on  their  location.  If  two  shutters  are  found 
then  perspective  distortion  can  be  taken  into  account 
when  looking  for  the  other  shutters,  even  without  a 
camera  model,  under  an  assumption  that  the  tops  and 
bottoms  of  the  set  of  shutters  lie  on  a  straight  line  [5]. 

5.  Results  of  Rule  Based  Image  Interpretation 

Experiments  are  being  conducted  on  a  set  of 
fifteen  ”house  scene”  images.  Thus  far,  we  have  been 
able  to  extract  sky,  grass,  and  foliage  (trees  and 
bushes)  from  nine  house  images  with  reasonable 
effectiveness,  and  have  been  successful  in  identifying 
houses  and  their  parts,  including  shutters  (or  windows), 
house  wall  and  roof  in  three  of  these  images.  The 
interpretation  strategies  use  many  redundant  features, 
each  of  which  can  very  often  be  expected  to  be 
present.  The  premise  is  that  many  redundant  features 
allow  any  single  feature  to  be  unreliable.  The  features 
utilized  vary  across  color  and  texture  attributes,  shape, 
size,  location  in  the  image,  relative  location  to 
identified  objects,  and  similarity  in  color  and  texture  to 
identified  objects.  Object  hypothesis  rules  were 
employed  as  described  in  previous  sections,  and 
additional  object  verification  rules  requiring  consistent 


relationships  with  other  object  labels  are  being 
developed.  The  final  results  shown  in  Figure  6  are  an 
interpretation  based  on  coarse  segmentations.  Further 
work  on  segmentation  (Figure  7)  is  being  carried  out, 
as  is  the  refinement  of  the  exemplar  selection  and 
matching  rules  (that  were  shown  in  section  3). 

An  extremely  important  capability  for  an 
interpretation  system  is  feedback  to  lower  level 
processes  for  a  variety  of  purposes.  The  interpretation 
processes  should  have  focus-of-attention  mechanisms  for 
correction  of  segmentation  errors,  extraction  of  finer 
image  detail,  and  verification  of  semantic  hypotheses. 
An  example  of  the  effectiveness  of  semantically 
directed  feedback  to  segmentation  processes  is  shown  in 
Figure  8.  Two  different  segmentations  are  shown;  the 
second,  with  less  image  detail,  was  used  here.  There  is 
a  key  missing  boundary  between  the  house  wall  and 
sky  which  leads  to  incorrect  object  hypotheses  based 
upon  local  interpretation  strategies.  The  region  is 
hypothesized  to  be  sky  by  the  sky  strategy,  while 
application  of  the  house  wall  strategy  (using  the  roof 
and  shutters  as  spatial  constraints  on  the  location  of 
house  wall)  leads  to  a  wall  hypothesis. 

There  is  evidence  available  that  some  form  of 
error  has  occurred  in  this  example:  1)  conflicting 

labels  are  produced  for  the  same  region  by  local 
interpretation  strategies;  2)  the  house  wall  label  is 
associated  with  regions  above  the  roof  (while  there  are 
houses  with  a  wall  above  a  lower  roof,  the  geometric 
consistency  of  the  object  shape  is  not  satisfied  in  this 
example);  and  3)  the  sky  extends  down  close  to  the 
approximate  horizon  line  in  only  a  portion  of  the 
image  (which  is  possible  but  worthy  of  closer 
inspection). 

In  this  case  resegmentation  of  the  sky-housewall 
region  with  segmentation  parameters  set  to  extract  finer 
detail  produces  the  results  shown  in  Figure  8a. 
Subsequent  remerging  of  similar  regions  produces  a 
usable  segmentation  of  this  region  as  shown  in  8b.  It 
should  be  pointed  out  that  in  this  image  there  is  a 
discernable  boundary  between  the  sky  and  house  wall. 
Initially,  the  segmentation  parameters  may  be  set  so 
that  the  initial  segmentation  misses  this  boundary.  This 
may  occur  because  of  computational  requirements  (fast, 
coarse  segmentations)  or  as  an  explicit  control 
However,  once  it  is  resegmented  with  an  intent  of 
overfragmentation,  this  boundary  can  be  detected. 
Rernerging  based  on  region  means  and  variances  of  a 
set  of  features  allows  much  of  the  overfragmentation  to 
be  removed.  Now,  the  same  interpretation  strategy 
used  eariler  produces  quite  acceptable  results  shown  in 
Figure  9. 

The  current  development  of  interpretation 
strategies  involves  the  utilization  of  stored  knowledge 
and  a  partial  model  (labelled  regions)  for  hypothesis 


extension  In  these  strategies  the  knowledge  network  is 
examined  for  object?  that  can  be  inferred  from 
identified  objects,  and  for  relations  that  would 
differentiate  them.  For  example,  the  bush  regions  can 
be  differentiated  from  other  foliage  based  on  their 
spatial  relations  to  the  house,  and  front  and  side  house 
walls  can  be  differentiated  using  geometric  knowledge 
of  house  structure  (e.g.,  relations  between  roof  and 
walls),  as  shown  in  Figure  10.  In  the  full  system,  these 
rules  would  not  work  in  isolation  as  shown  here,  and 
the  errors  made  by  this  type  of  rule  would  be  filtered 
by  other  constraints. 

Future  work  is  directed  towards  refinement  of  the 
segmentation  algorithms,  object  hypothesis  rules,  object 
verification  rules,  and  interpretation  strategies.  System 
development  is  aimed  towards  more  robust  methods  of 
control:  automatic  schema  and  strategy  selection, 
interpretation  of  images  under  more  than  one  general 
class  of  schema,  and  automatic  focus  of  attention 
mechanisms  and  error-correcting  strategies  for  resolving 
interpretation  errors. 
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Figure  1.  Abstract  representation  of  a  portion  of  the 
knowledge  network  used  to  produce  the  interpretation 
results.  Knowledge  about  a  class  of  scenes  is 
represented  by  a  hierarchy  of  schemata  embeded  in  a 
network.  Structural  descriptions  of  each  schema  are 
organized  by  the  component  descriptions  of  subclass 
and  subpart,  and  by  spatial  relations.  Each  schema 
node  has  acess  to  a  set  of  interpretation  rules  which 
form  hypotheses  on  the  bases  of  image  measurements, 
and  interpretation  strategies  which  describe  how  these 
hypotheses  are  combined  with  information  in  the 
network  to  form  a  consistent  interpretation.  Although 
every  node  in  the  network  is  considered  to  be  a 
schema,  only  selected  nodes  in  the  figure  show  the  full 
structure. 


Figure  2.  Image  histograms  of  an  ’’excess  green”  feature 
(2G-R-B)  computed  across  eight  sample  images.  The 
unshaded  histogram  represents  the  global  distribution  of 
the  feature.  The  darkest  cross  hatched  histogram  is  the 
distribution  of  this  feature  across  regions  known  to  be 
grass  (from  a  hand  labeling  of  the  images)  in  one  of 
three  specific  images.  The  intermediate  c.oss  hatching 
represents  all  known  grass  regions  across  the  entire 
sample.  Note  the  shifting  (with  respect  to  the  full 
histogram)  of  the  histograms  for  the  individual  images. 
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Figure  3.  Structure  of  a  simple  rule  for  mapping  an 
image  feature  measurement  fj  into  support  for  a  label 
hypothesis  on  the  basis  of  a  prototype  feature  value 
obtained  from  the  combined  histograms  of  labeled 
regions  across  image  samples.  The  object  specific 
mapping  is  parameterized  by  four  values,  fp,  0j,  ej, 
83,  and  stored  in  the  knowledge  network.  The  use  of 
six  values  will  allow  an  asymmetric  response  function. 


Figure  5.  The  exemplar  hypothesis  rule  is  more 
selective  than  the  corresponding  general  interpretation 
rule  (based  on  a  less  selective  rule  form).  Figure  5a 
shows  the  general  grass  interpretation  rule,  while  Figure 
5b  shows  the  exemplar  rule.  Note  that  the  general 
form  of  the  rule  results  in  more  incorrect  region 
hypothesis  (which  could  be  filtered  by  constraints  from 
the  knowledge  network).  Although  the  examplar  rule 
misses  some  grass  regions,  those  found  have  high 
confidence. 


Figure  4.  An  example  grass  rule,  showing  an 
asymmetrical  structure,  superimposed  on  the  histogram 
of  Figure  2c. 
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Figure  6.  Example  interpretations  for  three  of  the 
house  scene  images.  The  labeling  is: 


SKY 


Figure  7.  The  Segmentation  process  can  be  varied  to  produce  finer  distinctions  in  image  structure 
at  the  expense  of  a  larger  number  of  regions  and  subsequent  fragmentation  of  many  large  regions. 
Figure  7a  (course  detail  segmentation)  exhibits  missing  roof/sky  boundary;  the  finer  detail 
segmentation  (Figure  7b)  has  this  boundary,  although  it  was  formed  at  the  expense  of 
significantly  more  regions. 


Figure  8.  Resegmentation  of  house/sky  region  from  Figure  7a.  Figure  8a  is  the  original 
segmentation  showing  the  region  to  be  resegmentated;  8b  shows  the  regions  resulting  from  the 
selective  application  to  the  segmentation  process  to  the  cross-hatched  area  in  8a. 


Figure  9.  Final  interpretation  of  the  house  scene  in 
Figure  6c,  after  inserting  resegmented  houes/sky  regions 
and  reinterpreting  the  image 


Figure  10.  An  example  of  the  use  of  spatial  relations  to  filter  and  extend  region  labeling.  The 
geometric  relations  between  house  and  shrub  (in  10a)  and  between  between  roof  and  house  front 
wall  (in  10b)  are  used  to  refine  region  hypotheses  from  the  interpretation  shown  in  Figure  6c. 
Note  that  there  are  still  ambiguities  (the  shrub  label  in  the  grass  area,  and  the  pants  labeled  as 
house  wall)  that  require  the  use  of  other  filters. 
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Abstract 

Evidence  is  presented  showing  thnt  bottom-up  grouping  of 
image  features  is  usually  prerequisite  to  the  recognition  and 
interpretation  of  images.  We  describe  three  functions  of 
these  groupings:  l)  segmentation ,  2)  three-dimensional  in¬ 
terpretation,  and  3)  stable  descriptions  for  accessing  object 
models.  Several  unifying  principles  are  hypothesized  for 
determining  which  image  relations  should  be  formed:  rela¬ 
tions  are  significant  to  the  extent  that  they  are  unlikely 
to  have  arisen  by  accident  from  the  surrounding  distribu¬ 
tion  of  features,  relations  can  only  be  formed  where  there 
arc  few  alternatives  within  the  same  proximity,  and  rela¬ 
tions  must  be  based  on  properties  which  arc  invariant  over 
a  range  of  imaging  conditions.  Using  these  principles  we 
develop  an  algorithm  for  curve  segmentation  which  detects 
significant  structure  at  multiple  resolutions,  including  the 
linking  of  segments  on  the  basis  of  curvilincarity.  The  algo¬ 
rithm  is  able  to  detect  structures  which  no  single-resolution 
algorithm  could  detect.  Its  performance  is  demonstrated  on 
synthetic  and  natural  image  data. 

Introduction 

A  major  goal  of  computer  vision  research  is  to  relate  visual 
images  to  prior  knowledge  of  their  constituents  and  thereby 
label  and  interpret  them.  However,  current  model-based 
vision  systems  have  been  demonstrated  only  in  tightly- 
constrained  environments  with  a  few  well-specified  models 
to  compare  to  the  image  [2,  9,  12],  The  difficulty  in  ex¬ 
panding  performance  to  more  general  domains  is  not  one  of 
ambiguity  it  is  very  unlikely  that  two  different  models  will 
fully  fit  the  same  image  data.  Rather,  the  problem  is  one 
of  searching  for  potential  correspondences  between  models 
and  the  image,  since  increasing  the  number  and  generality 
of  the  models  results  in  an  excessively  large  space  of  pos¬ 
sible  matches.  Continued  research  into  recovering  three- 
dimensional  shape  from  images — using  stereo,  motion,  shad¬ 
ing,  and  texture  promises  to  reduce  the  size  of  this  search 
space  considerably.  However,  the  problem  of  matching  is 
far  from  solved  even  when  given  full  three-dimensional  in¬ 
formation,  and  these  methods  fail  to  explain  the  excellent 
level  of  human  performance  in  such  simpl-  domains  as  line 
drawings. 


In  order  to  interpret  images  about  which  we  have  little 
prior  knowledge,  it  is  necessary  to  use  effective  bottom-up 
techniques  to  structure  and  describe  the  image  in  a  form 
that  can  be  used  to  selectively  index  into  a  large  body  of 
world  knowledge.  In  this  paper  we  will  describe  methods 
for  detecting  and  evaluating  the  significance  of  relations  be¬ 
tween  image  elements  in  a  way  that  can  be  applied  uniformly 
to  all  images  before  we  have  any  knowledge  of  their  con¬ 
tents.  Previous  research  on  this  and  related  topics  has  gone 
under  such  names  as  image  segmentation,  figure/ground 
phenomena,  texture  description,  perceptual  organization, 
and  Gestalt  perception.  There  have  been  many  efforts  to 
develop  algorithms  for  specific  segmentation  problems,  such 
as  the  detection  of  collincarity  or  connectivity,  but  these 
have  not  been  integrated  and  have  often  lacked  general  ap¬ 
plicability.  Vlarr’s  initial  primal  sketch  formulation  [fi]  was 
intended  to  make  some  of  these  relations  explicit.  Recently, 
Witkin  and  Tenenbamn  [13]  have  argued  for  the  impor¬ 
tance  of  detecting  regularities  and  imposing  structure  on 
the  image  for  many  of  the  same  reasons  given  here.  They 
describe  a  unified  treatment  of  inference  based  on  the  as¬ 
sumption  that  regularities  detected  in  the  image  are  non¬ 
accidental.  In  this  paper  we  will  describe  the  role  that  this 
form  of  inference  plays  in  model-based  recognition,  develop 
some  underlying  principles  for  this  level  of  interpretation, 
and  present  new  segmentation  methods  based  upon  these 
principles. 

There  arc  three  valuable  sources  of  information  which 
the  bottom-up  organization  of  image  features  can  provide, 
all  of  which  simplify  the  problem  of  matching  against  world 
knowledge: 

1)  A  major  reduction  in  the  search  space  is  achieved  by 
segmentation — the  division  of  the  image  into  sets  of 
related  features.  This  has  long  been  recognized  as  a 
crucial  problem  in  image  interpretation.  We  do  not 
want  to  match  models  against  all  possible  combinations 
of  features  in  an  image,  so  good  segmentation  is  crucial 
for  reducing  the  combinatorics  of  this  search. 

2)  Two-dimensional  relations  lead  to  specific  three-dimen¬ 
sional  interpretations,  as  we  have  described  in  previous 
papers  [1,  5],  For  example,  colliucar  lines  in  the  image 
are  likely  to  be  collinear  in  3-space.  A  corollary  of 
this  is  that  these  image  relations  tend  to  be  invariant 
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with  respect  to  viewpoint,  which  greatly  simplifies  the 
problem  of  matching  to  three-dimensional  objects  of 
unknown  orientation. 

3)  lo  the  extent  that  these  relations  are  stable  under 
different  imaging  conditions  and  viewpoints,  they  can 
be  used  as  index  terms  to  access  a  body  of  world 
knowledge  Not  only  can  the  names  of  the  relations 
be  used,  but  in  addition  each  relation  will  have  several 
parameters  of  variation  whose  relative  values  in  the 
image  can  be  used.  For  example,  collinear  line  segments 
can  be  characterized  by  the  relative  sizes  of  the  seg¬ 
ments  and  gaps,  which  provides  a  viewpoint- invariant 
description  that  can  be  used  to  select  a  model  for  at¬ 
tempted  matching. 

Note  that  all  three  of  these  points  assume  that  the  relations 
found  in  the  image  are  a  result  of  regularities  in  the  ob 
jects  being  viewed,  which  means  that  any  relations  which 
happen  to  arise  accidentally  from  independent  features  will 
only  confuse  the  interpretations.  This  distinction  between 
significant  and  accidental  relations  is  a  point  to  which  we 
will  return. 

The  importance  of  segmentation 
for  recognition:  An  experiment 
The  importance  of  these  grouping  operations  as  a  stage  in 
the  processing  of  images  by  the  human  visual  system  can 
be  demonstrated  by  a  straightforward  psychophysical  ex¬ 
periment.  In  Figure  1(a)  we  have  constructed  a  partial  line 
drawing  of  a  bicycle  in  such  a  way  that  most  opportunities 
for  bottom-up  segmentation  arc  eliminated  (e.g.,  we  have 
eliminated  most  cases  of  significant  collincarity,  endpoint 
proximity,  parallelism,  and  symmetry).  In  informal  experi¬ 
ments  with  10  subjects  who  were  told  nothing  about  the 
identity  of  the  object,  this  drawing  proved  to  be  remark¬ 
ably  difficult  to  recognize.  Nine  out  of  10  subjects  were  un¬ 
able  to  recognize  the  object  within  a  60  second  time  limit, 
and  the  tenth  subject  took  15  seconds.  Note  that  this  is  in 
spite  of  the  fact  that  the  object  level  segmentation  has  al¬ 
ready  been  performed- the  task  would  be  even  hard  'r  if  the 
bicycle  were  embedded  in  a  normal  scene  containing  many 
surrounding  features. 

figure  1(b)  is  the  same  drawing  as  in  1(a)  with  only 
a  single  segment  added.  The  added  segment  was  placed  in 
a  strategic  location  which  would  allow  it  to  be  combined 
with  other  segments  in  a  curvilinear  grouping.  The  center 
of  this  circular  grouping  would  then  be  coincident  with  the 
termination  of  another  segment,  leading  to  further  group¬ 
ings.  As  might  be  expected  if  we  assume  that  bottom-up 
groupings  play  an  important  role  in  recognition,  the  recog¬ 
nition  times  for  this  second  figure  were  dramatically  lower 
than  for  the  first,  with  3  out  of  iO  subjects  recognizing  it 
within  5  seconds  and  with  7  out  of  10  subjects  recognizing  it 
within  the  60  second  time  limit.  Presumably,  if  the  added 
segment  had  been  placed  at  sorn  location  which  did  not 
lend  itscll  to  perceptual  grouping"  the  change  in  recognition 
times  would  have  been  negligible. 

These  figures  can  also  be  used  to  demonstrate  the 


Figure  1:  When  opportunities  for  bottom-up  grouping  of  image 
features  have  been  removed,  as  was  done  for  the  line  drawing  of 
a  bicycle  in  (a),  the  drawing  is  remarkably  difficult  to  recognize. 
The  average  recognition  time  for  (a)  was  over  one  minute  when 
the  subjects  had  no  prior  knowledge  of  the  object’s  identity. 
When  a  single  line  segment  was  added  in  (b),  which  provided 
local  evidence  for  a  curvilinear  grouping,  the  recognition  times 
were  greatly  reduced. 


human  capability  to  make  use  of  top-down  contextual  in¬ 
formation  to  limit  the  search  space  for  forming  a  match. 
As  was  demonstrated  in  experiments  performed  as  early  as 
1935  [3],  verbal  clues  naming  the  object  in  an  image  or 
even  naming  vague  nor.-visual  objeet  classes  can  drastically 
reduce  the  recognition  time.  Subjects  ran  usually  interpret 
f  igure  1(a)  immediately  upon  being  told  that  it  is  a  bicycle. 
Thus  this  figure  is  on  an  interesting  borderline  where  cither 
bottom- up  or  top-down  information  can  suddenly  reduce 
the  search  space  and  lead  to  recognition.  One  can  imagine 
a  series  of  experiments  that  would  systematically  explore 
this  search  space  and  the  reduction  in  its  size  created  by 
different  bottom-up  or  top-down  clues.  These  figures  can 
also  be  used  to  demon  irate  the  human  equivalent  of  a  back- 
projection  algorithm  [4]  followed  by  image-level  matching, 
where  certain  hypothesized  partial  matches  can  be  used  to 
solve  for  the  position,  orientation,  and  internal  parameters 
of  the  model,  which  in  turn  lead  to  predictions  for  further 
matches  at  specific  locations  in  the  image. 
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Principles  of  segmentation 
There  are  virtually  an  infinite  number  of  relations  that  could 
be  Termed  between  the  elements  of  any  image  What  general 
principles  can  we  derive  for  selecting  those  relations  which 
are  worth  forming  and  for  measuring  their  significance?  As 
was  mentioned  earlier,  segmentations  are  useful  only  to  the 
extent  that  they  represent  actual  structure  of  the  scene 
rather  than  accidental  alignments.  Therefore,  a  central 
function  of  the  segmentation  process  must  be  to  distinguish, 
as  accurately  as  possible,  significant  structures  from  those 
which  have  arisen  at  random  All  of  the  relations  we  have 
considered  can  arise  from  accidents  of  viewpoint  or  random 
positioning  as  well  as  from  structure  in  the  image.  However, 
by  examining  the  accuracy  of  each  relation  and  the  sur¬ 
rounding  distribution  of  features  in  the  image,  it  is  possible 
to  give  probabilistic  measures  of  the  likelihood  that  any 
given  relation  is  accidental.  These  nonrandomness  measures 
can  then  be  used  as  the  basic  test  for  significance  during  the 
segmentation  process. 

If  there  were  a  significant  level  of  prior  knowledge 
regarding  the  expected  distributions  of  features  and  rela¬ 
tions,  this  could  be  used  for  judging  the  significance  of  seg¬ 
mentations.  However,  the  range  of  common  images  seems 
to  be  so  wide  that  any  prior  knowledge  at  this  level  must  be 
very  weak.  We  have  chosen  to  carry  out  our  computation 
of  significance  with  respect  to  the  null  hypothesis  that  fea¬ 
tures  arc  independent  with  respect  to  orientation,  location, 
and  scale.  Significance  is  then  inversely  proportional  to  the 
probability  that  the  relation  would  have  arisen  from  such 
a  set  of  independent  features.  It  is  a  matter  for  psychologi¬ 
cal  experimentation  to  sec  whether  the  human  visual  sys¬ 
tem  is  biased  in  any  direction  from  this  independence  as¬ 
sumption.  Hut  since  a  scene  typically  contains  many  inde¬ 
pendently  positioned  objects  (leading  to  independence  with 
respect  to  orientation,  location,  and  scale  in  the  image),  the 
discrimination  of  relations  with  respect  to  this  background 
seems  like  a  reasonable  criterion  for  judging  significance. 

A  second  major  principle  of  segmentation  is  that  each 
operation  must  have  limited  computational  complexity.  It  is 
obviously  impossible  to  test  all  combinations  of  features  in 
an  image,  so  the  relations  can  only  be  formed  over  distances 
that  do  not  include  too  many  falsa  candidates  of  the  par¬ 
ticular  type  being  examined  [G] .  Figure  2  shows  an  example 
in  which  a  highly  significant  grouping  of  five  equally-spaced 
eollinear  dots  is  not  apparent  to  human  vision  when  there 
are  enough  surrounding  false  targets.  It,  would  presumably 
be  useful  for  the  purposes  of  interpretation  and  recogni¬ 
tion  to  detect  such  a  statistically  significant  grouping,  so 
this  failure  must  be  attributable  to  a  lack  of  computational 
resources.  Tli  is  does  not  mean  that  groupings  arc  diameter- 
limited  in  any  absolute  sense,  since  grouping"  can  be  at¬ 
tempted  at  many  different  scales;  however,  if  there  are  more 
than  a  few  false  candidates  at  some  scale,  then  no  groupings 
can  be  formed  at  that  scale  of  description. 

The  principles  above  describe  which  groupings  will  be 
formed  and  how  they  will  be  evaluated  for  a  given  class 
of  relations,  but  they  do  not  specify  which  classes  of  rela¬ 
tions  will  be  attempted.  There  are  several  factors  which 


a. 


b. 


Figure  2:  The  pattern  of  five  equally-spaced  eollinear  dots  in  (a) 
is  not  detected  spontaneously  by  human  vision  if  it  is  surrounded 
by  enough  competing  candidates  for  grouping,  as  in  (b).  This 
occurs  even  though  the  relation  remains  highly  significant  in 
the  statistical  sense  and  would  therefore  likely  be  of  use  for 
recognition. 

influence  this  choice.  One  important  factor  is  the  same 
imaging- invariance  condition  that  was  mentioned  earlier — 
it  is  only  worth  looking  for  image  relations  which  do  not 
depend  on  a  specific  viewpoint,  light-source  position,  or 
other  image-formation  parameter.  For  example,  collinearity 
is  useful  because  it  is  present  in  the  image  over  all  view¬ 
points  of  collinearity  in  the  scene.  Hut  it  would  be  point¬ 
less  to  detect  lines  at  right- angles  in  the  image,  since  even 
if  right-angles  are  common  in  the  scene  the  angle  in  the 
image  would  change  with  almost  any  change  in  viewpoint. 
More  generally,  Wilkin  and  Tcncnbaum  [13]  argue  that 
prior  probabilities  play  a  role  in  selecting  which  relations 
arc  the  easiest  to  distinguish  from  accidentals,  and  should 
therefore  be  attempted.  If  sonic  relation  arises  only  very 
rarely  from  the  structure  of  typical  scenes,  then  it  is  more 
likly  that  some  instance  of  the  relation  in  an  image  is  ac¬ 
cidental  (although  it  would  still  be  possible  to  distinguish 
the  relation  from  accidentals  given  accnratc-enougli  image 
mcas:u remen ts).  Or  course,  it  is  also  less  productive  to  spend 
time  searching  for  properties  which  seldom  arise  than  for 
those  which  are  common. 

An  algorithm  for  curve  segmentation 
A  significant  bottleneck  in  creating  a  computer  program 
which  can  perform  these  bottom- up  perceptual  processes  on 
natural  images  is  the  problem  of  creating  appropriately  seg¬ 
mented  edge  desriptions.  The  besl  current  edge  operators 
detect  “edge  points”  which  are  then  linked  using  nearest- 
neighbor  algorithms  into  lists  of  points.  Although  there 
has  been  considerable  research  into  the  problem  of  fitting 
smooth  curves  to  these  lists  of  points  [10,  11,  12],  almost 
without  exception  these  efforts  have  concentrated  on  a  single 
pre-sclcctcd  resolution  of  segmentation  and  have  attempted 
merely  to  smooth  out  noise  induced  by  the  imaging  process 
Although  these  smoothed  results  may  appear  reasonal  Jo  to 
the  naive  human  eye,  that  is  because  the  human  visual  sys 
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tein  can  still  perform  the  lower  resolution  groupings  even 
though  they  have  riot  been  detected  and  described  by  the 
program  Figure  3  illustrates  the  problem,  where  the  seg¬ 
mentation  in  I'  igure  3(b)  is  adequate  to  recognize  one  in¬ 
stance  of  collincarity,  but  other  groupings  are  only  apparent 
when  lower  resolution  structures  are  recognized  as  in  Figure 
3(c)-  It  is  not  adequate  to  merely  apply  previous  single- 
resolution  methods  at  multiple  resolutions,  since  only  some 
structures  at  some  resolutions  will  be  significant  and  the 
measurement  of  this  signiOcancc  is  important  for  further 
inference.  Therefore,  we  have  developed  a  new  algorithm, 
based  on  the  principles  of  segmentation  outlined  earlier, 
which  measures  the  degree  of  nonrandom  structure  in  edge- 
point  lists  over  a  wide  range  of  resolutions.  In  our  implemen¬ 
tation,  we  examine  all  groupings  which  are  cither  linear  or 
of  constant  curvature.  These  can  be  splined  to  represent 
arbitrary  smooth  curves,  although  it  is  possible  that  human 
vision  includes  the  detection  of  more  general  primitive  curve 
groupings,  such  as  spirals. 

Measuring  the  significance  of  a  curve  segmentation 

The  first  task  in  developing  a  segmentation  algorithm 
is  to  determine  how  we  will  measure  the  significance  of 
each  grouping.  In  this  case,  since  the  points  were  originally 
linked  on  the  basis  of  proximity,  we  must  be  careful  not  to 
confuse  nonrandomness  in  proximity  with  the  measurement 
of  nonrandomness  in  linearity.  For  example,  if  we  start  by 
lookmg  at  a  set  of  only  three  points,  we  might  measure  the 
significance  of  their  linearity  by  measuring  the  distance  of 
one  point  from  the  line  joining  the  other  wo.  However, 
Inis  would  confuse  the  effects  of  proximity  with  those  of 
linearity,  since  by  being  close  to  one  of  the  other  points  the 
third  point  would  automatically  be  close  to  the  line  on  which 
they  lie,  as  is  shown  in  Figures  4(a)  and  4(b).  Therefore,  we 
have  chosen  to  define  nonrandomness  in  linearity  to  be  how 
unlikely  a  point  is  to  be  as  dose  as  it  is  to  a  curve  given 
its  distance  from  the  closest  defining  point  of  the  curve. 
This  is  equal  to  20 /it,  where  0  is  the  angle  between  the  line 
and  the  vector  from  the  closest  endpoint,  which  for  points 
close  to  the  line  is  approximately  equal  to  the  distance  from 
the  line  divided  by  the  distance  from  the  closest  endpoint. 

1  his  can  be  extended  to  4  or  more  points  by  recursively 
looking  for  the.  point  which  is  farthest  away  from  arry  of 
the  points  considered  thus  far  and  calculating  the  likelihood 
for  that  point  in  terms  of  its  minimum  proximity  to  these 
previously  considered  points.  Since  these  likelihood  values 
are  independent,  they  can  be  multiplied  together  to  produce 
an  overall  value  for  the  curve. 

I  he  algorithm  could  be  made  symmetric  with  respect 
to  the  set  of  points  by  using  some  sort  of  best-fit  curve 
rather  than  selecting  a  subset  of  points  to  define  the  curve. 
However,  for  the  results  displayed  in  this  paper  we  have 
adopted  the  computationally  expedient  but  less  accurate 
method  of  defining  the  line  by  the  two  points  with  greatest 
separation  and  adding  the  most  central  point  to  define  a 
circular  arc.  iotc  that  this  test  will  only  be  used  to  measure 
signiiicance  and  not  the  Gnal  positions  of  curve  segments, 
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Figure  3:  The  data  in  (a)  can  be  segmented  at  at  least  two 
different  resolutions  of  description,  as  shown  in  (b)  and  (c).  One 
instance  of  collincarity  can  only  be  detected  in  segmentation  (b) 
while  the  other  instance  of  collinearity  and  the  parallelism  can 
only  be  detected  in  (c). 


Figure  4s  The  middle  dots  in  (a)  ar.d  (b)  are  both  the  same, 
distance  from  the  lines  joining  the  other  two  dots.  Yet  the  three 
dots  in  (b)  are  much  more  significant  in  terms  of  their  collinearity 
than  those  in  (a),  since  the  middle  riot  in  (a)  could  be  close  to 
the  line  merely  as  a  result  of  its  proximity  to  the  first  endpoint. 
Therefore,  we  measure  the  probability  of  a  point  being  within  a 
given  distance  from  a  line  in  terms  of  its  proximity  to  the  closest 
endpoint  defining  the  line,  as  shown  in  (c). 
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and  that  small  relative  errors  in  ranking  segmentations  will 
not  be  very  important  our  nonrandomness  values  (inverse 
of  probability)  range  from  a  minimum  of  1.0  for  apparently 
random  points  up  to  at  least  10(i  for  many  points  placed 
accurately  along  smooth  curves. 

Testing  all  possible  groupings 

Given  this  significance  test  for  sets  of  points,  we  want  to 
divide  the  initial  linked  list  of  points  into  segments  which 
lead  to  the  highest  significance  values.  Previous  methods 
of  curve  segmentation  have  usually  attempted  to  search  for 
corners  (tangent  discontinuities)  on  a  curve,  where  the  curve 
can  be  divided  into  different  segments.  Our  approach  is 
the  dual  we  look  for  segments  of  the  curve  which  exhibit 
significant  nonrandomness,  and  tangent  and  curvature  dis¬ 
continuities  are  assigned  to  the  junctions  between  neighbor¬ 
ing  segments.  In  contrast  to  the  earlier  approaches,  our 
method  will  fail  to  assign  any  segmentation  where  the  curve 
appears  to  wander  randomly  at  all  resolutions,  and  will  as¬ 
sign  multiple  segmentations  where  it  exhibits  different  struc¬ 
ture  at  different  resolutions. 

It  would  clearly  be  too  costly  to  test  m  -ry  possible  seg¬ 
ment  of  the  curve  for  nonrandoinncss.  How  er,  if  we  allow 
a  reasonable  margin  of  error,  it  is  possible  to  cover  all  scales 
and  locations  with  a  relatively  small  number  of  groupings. 
We  have  chosen  to  examine  groupings  at  all  scales  differing 
by  factors  of  two,  from  groupings  of  only  three  adjacent 
points  up  to  groupings  the  size  of  the  full  length  of  the  curve 
(amounting  to  ft  scales  for  a  curve  of  100  points).  At  each 
scale,  we  examine  groupings  at  all  locations  along  the  curve, 
with  adjacent  groupings  overlapping  by  50%.  This  means 
that  any  given  segment  of  the  curve  will  have  at  least  one 
grouping  attempted  which  covers  50%  of  its  length  but  does 
not  extend  outside  its  borders.  In  addition,  while  calculat¬ 
ing  the  nonrandom  ness  measure,  each  segment  is  extended 
(o  include  points  on  both  sides  which  arc  close  enough  to 
the  curve  that  their  inclusion  increases  the  nonrandomness 
value. 

Many  of  the  segments  pioduccd  by  this  exhaustive  test¬ 
ing  will  not  exhibit  significant  nenratidomness  and  others 
will  be  qualitatively  the  same  as  larger  segments  of  which 
they  are  merely  a  subset.  Therefore,  a  thinning  procedure 
is  executed  which  steps  through  the  different  resolutions  at 
each  location  along  the  curve  and  selects  only  those  seg¬ 
mentations  which  am  locally  maxi  num  in  their  significance 
values.  It  is  possible  that  there  will  be  more  than  one  lo¬ 
cal  maximum  if  the  curve  exhibits  different  structures  at 
different  resolutions  of  grouping.  There  is  also  a  threshold 
at  the  0.05  significance  level,  below  which  groupings  are  not 
considered  significant.  Once  again  these  choices  are  some¬ 
what  expedient,  and  we  arc  seeking  a  more  fundamental 
method  of  combining  multiple  resolutions. 

The  algorithm  in  action 

This  algorithm  have  been  implemented  in  Maclisp  on  a  KL- 
10  computer  and  tested  on  synthetic  data  as  well  as  edges- 
derived  from  natural  images.  Figure  5(a)  shows  some  hand- 
drawn  curves  which  exhibit  different  structures  at  different 


resolutions,  much  as  was  shown  in  f  igure  3.  Figure  5(b) 
gives  the  output  of  the  curve  segmentation  algorithm  when 
given  this  data,  and  demonstrates  the  algorithm’s  ability  to 
detect  significant  structure  at  multiple  resolutions  results 
which  no  single- resolution  algorithm  could  have  produced. 

Figure  fi  shows  the  results  of  running  the  algorithm  on 
a  small  30  by  -15  pixel  region  of  an  aerial  photograph  of  an 
oil  tank  facility.  1  he  original  digitized  image  is  shown  in 
6(a).  Figure  6(h)  shows  some  linked  edge  data  generated 
from  this  image  by  an  edge  detection  program  written  by 
David  Marimont  [7],  which  detects  edge  points  to  subpixcl 
accuracy  and  links  them  into  lists.  Figure  6(c)  shows  all  the 
groupings  at  all  resolutions,  although  the  widely  differing 
significance  values  arc  not  apparent.  Figure  6(d)  shows  the 
results  after  the  thinning  process  which  selects  local  maxima 
with  respect  to  resolution.  Given  these  segments,  it  is  rela¬ 
tively  easy  to  form  collincarity  and  curvilinearity  relations 
between  them  as  shown  by  the  dotted  lines  in  Figure  6(c). 
It  would  also  be  fairly  straightforward  to  detect  endpoint 
proximity,  parallelism,  constant  intervals,  and  other  percep¬ 
tual  groupings. 


Figure  5:  The  hand- input  curves  in  (a)  have  been  created  to  ex¬ 
hibit  significant  structure  at  multiple  resolutions.  When  these  are 
given  as  data  to  the  curve  segmentation  algorithm,  it.  produces 
the  results  shown  in  (b),  which  makes  these  multiple  levels  of 
structure  explicit. 
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Figure  6:  The  small  30  by  -15  pixel  region  of  an  aerial  photograph 
shown  in  (a)  was  run  through  the  Marimont  edge  detector  to 
produce  the  linked  edge  points  shown  in  (b).  Figure  (c)  shows  all 
the  segments  at  different  scales  and  locations  which  were  tested 
for  significance.  After  selecting  only  those  segments  which  were 
locally  maximum  with  respect  to  size  of  grouping,  and  threshold¬ 
ing  out  those  which  are  not  statistically  significant,  we  arc  left 
with  the  segments  shown  in  (d).  It  is  then  fairly  simple  to  form 
collineartty  and  cun 'linearity  relations  between  these  segments 
as  shown  by  dotted  'men  in  (<>). 


Summary 

Wc  started  this  paper  by  demonstrating  the  importance  of 
bottom- up  perceptual  organization  for  human  vision.  These 
image,  relations  play  a  major  role  in  limiting  the  size  of  the 
search  space  that  must  be  considered  when  matching  against 
world  knowledge.  The  unifying  prin,;ip|es  Qf  detecting  non- 
random  structure,  avoiding  comK  ttorial  complexity,  and 
looking  for  viewpoint- invariant  relations  were  suggested.  An 
algorithm  for  curve  segmentation,  based  upon  these  prin¬ 
ciples,  was  developed  and  demonstrated.  These  curve  seg¬ 
mentations  enabled  the  use  of  a  relatively  simple  algorithm 
for  grouping  on  the  basis  of  curvilinearity,  and  extensions 
for  detecting  other  classes  of  groupings  seem  to  be  within 
reach.  There  are  many  other  problems  besides  recognition 
in  which  these  groupings  would  be  useful.  An  example  is 
the  stereo  correspondence  problem,  since  to  the  extent  that 
these  image  relations  represent  structure  in  the  scene  and 
are  invariant  with  respect  to  viewpoint,  they  can  be  expected 
to  remain  visible  in  images  taken  from  different  viewpoints. 
They  would  then  provide  far  less  ambiguous  structure  for 
matching  than  simple  edge  points. 

The  specific  algorithms  developed  are  preliminary  im¬ 
plementations  of  the  general  methodology  of  segmenting 
perceptual  data  by  looking  at  groupings  over  a  wide  range  of 
scales  and  locations  and  retail  ing  those  which  are  the  most 
unlikely  to  have  arisen  by  accident  from  the  background 
distribution.  This  same  methodology  could  be  applied  to 
a  wide  range  of  other  perceptual  segmentation  problems  or 
signal  analysis. 
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A  bs  tract 

In  recent  years.  Binloid's  generalized  cylinders  have  become  an 
important  tool  lor  image  understanding.  However,  research  has 
been  hampered  by  a  lack  ol  analytical  results  lor  these  shapes.  In 
this  paper ,  a  definition  is  presented  lor  Straight  Homogeneous 
Generalized  Cylinders,  those  generalized  cylinders,  with  a  straight 
axis  and  with  cross-sections  which  have  constant  shape  but  vary  in 
size.  This  class  ol  shapes,  while  still  guile  large,  has  properties 
which  make  considerable  analysis  possible, 

The  results  begin  v/it'n  deriving  lornutlae  lor  points  and  surface 
normals  lor  these  shapes.  Theorems  are  presented  concerning  the 
conditions  tinder  which  multiple  descriptions  can  exist  lor  a  single 
solid  shape.  1  hen  projections  and  contour  generators  are  analyzed 
lor  some  subclasses  ol  shapes.  The  strongest  results  arc  obtained 

j.  *  r.  “  -  f 

tor  solids  ol  revolution  (which  we  name  Right  Circular  SHGCs),  lor 
which  a  closed  loan  method  lor  analyzing  image  contours  is 
presented.  It  is  seen  that  a  picture  ol  the  contours  ol  a  solid  ol 
revolution  is  ambiguous,  with  one  degree  ol  freedom  related  to  the 
angle  between  the  line  ol  sight  and  l ho  solid's  axis. 


1.  Introduction 

In  recent  years,  the  generalized  cylinders  proposed  by  Binford 
[2]  have  become  a  popular  shape  representation  scheme  for  image 
understanding.  Unfortunately,  research  has  been  hampered  by  a 
lack  of  analytical  results  for  these  shapes.  This  paper  presents 
Straight  Homogeneous  Generalised  Cylinders  and  performs 
analysis  of  several  of  their  basic  properties. 

Straight  Homogeneous  Generalized  Cylinders  (SHGCs)  are 
defined  by  a  cross-section  shape  which  sweeps  along  a  straight 
line  axis,  changing  size  uniformly  as  it  is  swept.  This  class  of 
shapes,  while  still  guile  largo  in  itself,  has  properties  which  allow 
considerable  analysis  to  bo  pci  formed.  Formulae  are  developed  for 
points  on  the  suiface  of  an  SHGC  and  for  surface  normals  of  an 
SHGC. 

Many  researchers  in  the  past  have  approached  generalized 
cylinders  by  tiying  to  specify  the  "canonical"  description  for  a 
given  shape  We  take  a  different  approach,  allowing  a  shape  to 
have  many  different  descriptions  as  a  generalized  cylinder  This 
solves  the  very  difficult  problem  of  trying  to  specify  a  single 
description  as  tho  best";  however,  it  introduces  the  problem  of 
deciding  when  two  descriptions  are  in  fact  describing  tho  same 
shape.  This  "Fquivalence  Prof dent"  is  given  some  attention  and 
seine  United  (but  stilt  very  useful)  results  am  presented. 


We  then  describe  contours  of  tangoncy  with  the  line  of  sight, 
which  will  include  "outlines"  and  visible  "folds"  in  an  imago  of  an 
SHGC.  We  examine  in  some  detail  the  special  case  of  Right 
Circular  SHGCs  (solids  of  revolution),  which  have  sufficiently  strong 
properties  to  allow  very  detailed  contour  analysis.  Using  these 
shapes,  we  develop  a  contour  analysis  technique  and  make  several 
interesting  obseivations  about  occlusion  and  singularities  in 
tangoncy  contours.  Wo  show  that  an  image  of  the  contours  of  j 
solid  of  revolution  is  ambiguous  with  one  degree  of  freedom  in  tho 
interpretation. 

2.  Straight  Homogeneous  Generalized  Cylinders 


Figure  1  shows  a  Straight  Homogeneous  Generalized  Cylinder 
(SIIGC),  as  described  in  tho  taxonomy  of  [8]  An  SHGC  is  a 
function  which  maps  two  parameters  onto  a  set  of  points  in  x-y-z 
space  (i.e.  the  woild).  The  two  parameters  are  s,  which  meaauius 
distance  along  tho  axis,  and  f,  which  indirectly  measures  distance 
along  the  cross- section;  both  s  and  t  Itave  as  domain  the  unit 
interval  [0,1].  This  development  is  similar  to  that  of  Ballard  and 
Drown  [1], 

A  Straigiit  Homogeneous  ieneralized  Cylinder  is  specified  by  a 
Four  tuple  (A,  C,  r,  a).  A  is  the  axis,  which  is  a  curve  in  space 
defined  in  parametric  form  by  A (s)  =  (*A,  yv  zA)  (s).  The  remainder 
of  this  discussion  will  desribe  features  of  the  shape  relative  to  the 
axis  itself  rather  than  in  absolute  x-y-z  coordinates. 

At  each  point  A(s)  on  the  ax's,  lot  the  cross-section  be  described 
on  a  u-v  plane,  with  A(s)  at  the  origin,  and  defined  by  the  (constant) 
angle  a.  The  u  axis  will  be  the  direction  of  steepest  descent  of  the 
u-v  plane  from  the  tangent  to  tho  axis  (i.e.  the  projection  of  the 
tangent  to  the  axis  onto  the  u-v  plane).  «,  the  angle  ol  inclination,  is 
the  angle  from  the  u-axis  to  the  tangent  to  the  axis  at  A(s);  a  -  0 
means  that  ihe  u  axis  is  pointing  towards  A(1),  and  a  -  v  means 
that  the  u-axis  is  pointing  towards  A(0). 
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In  a  Straight  Homogeneous  Generalized  Cylinder,  the  "Straight" 
attribute  implies  that  A(s)  is  a  linear  function.  The  axis  is  thus  a  line 
segment,  and  all  u-v  planes  are  parallel.  This  property  is  important 
because  all  tangents  to  A[s)  are  parallel,  as  are  all  u-axes  and  all 
v-axes.  We  can  therefore  define  vectors  U,  and  '/  pointing  in 
these  directions  and  assiejn  a  local  (object' centered)  coordinate 
system  using  ti-  vs  coordinates.  Such  coordinates  can  of  course  be 
defined  for  all  Generalized  Cylinders;  however  there  will  be  no 
cuivature  in  the  coordinate  axes  for  SGCs.  The  local  coordinates 
cf  an  SGC  are  a  linear  transformation  of  world  (x-y-z)  coordinates. 

On  the  u-v  cross-section  plane  for  each  value  of  s,  the  cross- 
section  is  the  set  of  points  r(s)C(t)  for  values  of  t  from  0  to  1, 
inclusive.  The  cross  section  function  C{t)  =  (uc,vc)  (t)  describes 
tho  shape  of  the  cross  section;  the  radius  function  r(s)  describes  its 
size.  So.  the  cross-sections  have  the  same  shape  but  may  vary  in 
size;  this  property  is  implied  by  the  "Homogeneous"  attribute.  The 
union  of  the  cross-sections  is  the  surface  of  the  Straight 
Homogeneous  Generalized  Cylinder. 

According  to  this  strict  definition,  some  very  peculiar  shapes  are 
allowed  as  SHGCs,  including  those  with  singular  points  or  arcs  and 
those  with  cross-sections  that  are  open  arcs,  points,  or  even  space¬ 
filling  curves.  Since  our  ultimate  goal  is  the  analysis  of  the  shapes 
of  commen  objects,  we  will  generally  exclude  such  bizarre  cases 
irom  further  consideration.  However,  since  shapes  with  degenerate 
cross-seclions  (open  arcs  or  points)  do  have  several  important 
pioperties,  we  will  note  them  when  appropriate. 

We  impose  the  restriction  that  the  functions  A  and  r  be 
continuous  and  differentiable  everywhere  and  that  the  cross- 
section  C  be  continuous  and  differentiable  almost  everywhere.  It  is 
usual,  but  not  required,  that  the  u-v  origin  be  in  the  interior  of  the 
cioss-svct'oii.  In  addition,  we  will  presume  "uniform  scaling"  of  s 
and  t,  i  e.  ||d4/tls||  and  ||dC/d(||  are  constants. 

2.1  Subclasses  of  SHGC 

We  will  also  be  referring  to  several  subclasses  of  SHGC  with 
particular  interesting  properties: 


Linear  SHGC  (LSHGC)  -  SHGC  with  r  linear  (figure  2) 

The  size  of  the  cross-section  varies  linearly  with  distance 
along  the  axis.  LSHGCs  are  ruled  surfaces  as  well  ns 
being  Generalized  Cylinders  [4]. 

FUgh I  SHGC  (RSHGC)  SHGC  wiih  «  =  vr/2 

The  u-v  planes  are  normal  to  the  axis.  There  is  ro 
"direction  of  steepest  descent"  relative  to  the  axis,  so  the 
u  axis  may  be  chosen  in  any  direction  on  the  cross- 
section  planes. 


Circular  SHGC  (CSHCC)  ■■  SHGC  with  C  a  circle  centered  at  the 
origin 

Without  loss  of  generality,  lot  C  be  a  unit  circle,  C(()  = 
(uc,  vc)  (f)  =  (cos  2 irt,  sin  2-nt)  All  surfaces  of  solids  of 
revolution  are  Right  Circular  SHGCs  (but  with  open  ends 
unless  r(0)  =  0orr(1)  =  0). 

Polyc/onal  SHGC  (PSHGC)  ■■  SHGC  with  C  polygonal  (piecewise 
linear) 

If  C(r0)  is  a  vertex  for  some  tQ,  then  the  set  of  points  P(s,f. ,) 
is  a  crease  (ridge  if  C  convex  there,  valley  if  concave). 
Otherwise,  P(s,f)  is  on  a  lace ;  note  that  faces  are  not 
necessarily  planar  in  this  definition. 

in  various  situations,  the  consequences  of  these  properties  wiil 
be  shown  to  be  of  special  interest. 

3.  Coordinates  for  SrIGCs 

fror  any  SHGC,  there  is  a  natural  u-v-s  object-centered 
coordinate  system  imposed  by  the  preceding  definition. 


v/j 

\  W 

3L-U 

/  / — 

/ 

Figure  -3:  Coordinate  Axes  for  SHGCs 


Wo  will  adopt  the  convention  that  the  v-axir.  is  chosen  to  provide  a 
right-handed  u-v-s  coordinate  system  The  unit  vectors  in  the  axis 
directions  will  be  denoted  V,  V  and  S,  as  shown  in  figure  3.  It  will 
bo  convenient  to  define  an  oiihogonal  rv-v-s  coordinate  system 
using  W  perpendicular  to  Vand  S.  For  any  point  (t /,  v,  .s)1|V3  (wheie 
uvs  denotes  coordinates  in  the  u-v-s  system),  the  corresponding 
coordinates  in  w-v-s  are  (u  sin  a  v,  s  +  u  cos  «»)  .  In  a  Right 

SHGC,  since  U  -  W,  u-v-s  and  vv-v-s  coordinates  are  Identical. 

We  will  use  the  notation  ms  (otc.)  whenever  the  coordinates  are 
given  in  a  system  other  than  the  world  (x  y-z)  or  image  (x-y) 
systems. 

3.  I  Points  on  the  Surface  ancf  Surface  Normals 

For  any  values  s  and  f,  the  point  P(s,  t)  en  the  surface  of  the 
SHGC  has  rv-v-s  coordinates: 

P(s,r)  =  [ujt)  r(s)  sin  o,  vc(f)  r(s),  s  +  uc(t)  r(s)  cos  u)^ 

(3  1) 

Surface  normals  for  an  SHGC  can  be  defined  wherever  the 
cross-section  function  C(t)  is  differentiable.  Tire  outward  pointing 
surface  normal  vector  W(s,t)  at  P(s„*)  is  the  cross  product  of  the 
tangent  vectors  to  the  surface  in  the  directions  of  increasing  s  and 
f; 

dP  9  P 

A ris,t)  = - X - 
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We  will  use  N(s,t)  parallel  to  ff(s,t),  defined  by 
W(s,/) 


Ms  f)  = 


r(i) 


.  dr  dvc 

=  (/)(.')  cos  a  —  + - ,  -  sin  a 

ds  df  df 


dur  dr . 

- ,  -  h{t)  sin  a — 

ds 


where  />((),  defined  by 
dvr 


d  Ur 


MO  -c(')---c(0-d, 


duc/df  dvc/d/ 


is  the  Wionskian  of  the  cross-section  functions  uc  and  vc  [7J;  /?(/)  = 
0  implies  that  the  SHGC  has  a  line  segment  for  a  cross-secbcn,  i.e. 
is  degenerate. 

The  surface  normals  of  an  SHGC  jboy  two  very  important 
properties,  stated  in  the  Corresponding  Normal  Theorem'. 

A  non-degenerate  SHGC  is  Linear  iff,  for  all  s,  dN/ds 
-  (0,0,0);  a  non-degenerate  SHGC  is  Polygonal  1,1  for 
almost  all  i,  9/7/9/  =  (0,0,0). 


'/  / 


This  says  that  the  surface  normals  for  an  LSHGC  depend  only  on  / 
(i.O.  are  parallel  along  contours  of  constant  /),  and  for  a  face  of  a 
PSHGC  depend  only  on  s  (i.e.  are  parallel  along  contours  of 
ccnslant  «  v/ithin  a  face).  A  proof  of  Ihe  Corresponding  Normal 
Theorem  is  presented  in  [8],  T  he  Corresponding  Normal  Theorem 
is  especially  useful  in  shadow  geometry,  since  the  "shadow 
volume"  (the  volume  cf  space  shaded  by  an  object)  is  an  LSHGC 
(figure  4). 

4.  The  Equivalence  Problem  for  Shape 
Descriptions 

A  subtle  problem  arises  from  our  definition  of  SHGCs:  as  we 
have  defined  SHGCs,  an  SHGC  is  actually  a  description  of  a  shape 
rather  than  tc-mg  a  specific  solid  shape  itself.  Of  course,  each  such 
description  describes  a  unique  shape;  however,  we  must  attempt  to 
decide  when  a  single  shape  may  tiave  several  different 
descriptions.  Since  a  solid  shape  corresponds  to  an  equivalence 
class  of  descriptions  (re.  SHGCs),  we  will  call  two  descriptions 
equivalent  whe.i  they  describe  the  same  solid  shape,  as  did  Marr 
and  Nishihara  in  [6],  hence,  we  refer  to  this  problem  as  the 
Equivalence  Problem.  We  will  actually  use  the  term  "equivalent"  in 
this  paper  when  the  ends  are  slanted  differently,  as  long  as  the 
shapes  are  otherwise  the  same. 


There  are  four  trivial  changes  possible  in  the  s  and  /  coordmates 
themselves  while  preserving  equivalence: 

a  the  axis  can  be  flipped  end-over  end  to  yield  a  new 
SHGC  (reversing  the  sense  of  the  s  coordinate) 
a  the  sense  of  /  can  be  likewise  reversed  and,  if  the 
cross  section  C(l)  is  closed,  the  point  at  which  /  =  0 
can  be , chilled  to  anywhere  on  the  curve 

•  the  radius  function  r(s)  can  be  multiplied  by  any 
constant  scale  factor,  while  the  cross-section  C[t)  is 
divided  by  the  same  factor 

•  an  RSHGC  can  have  the  u-v  axes  rotated  about  the 
origin  arbitrarily  (shifting  the  f  coordinate). 

These  transformations  are  sufficiently  simple  that  no  deeper 
discussion  is  needed. 

There  are,  however,  more  significant  variations  in  the  possible 
descriptions  of  a  specific  shape  as  an  SHGC.  We  will  investigate 
two  of  the  principal  types  of  variation:  altering  the  orientation  of  the 
cross-section  planes  and  altering  the  direction  of  the  axis, 

4.1  The  Equivalent  Right  SHGC  Problem 

What  properties  of  a  shape  make  it  possible  to  describe  it  as  two 
different  SHGCs,  with  cross-section  planes  at  different 
orientations?  Since  this  question  is  so  general,  'tie  will  limit  our 
attention  to  a  more  restricted  (but  stll!  difficult)  question:  For  what 
SI-'GCs  are  there  equivalent  Right  SHGCs?  This  is  interesting  since 
the  RSHGC  seems  to  be  a  natural  "canonical"  form  of 
representation  for  a  shape.  We  will  ignore  the  effect  of  "beveled” 
ends  resulting  from  values  of  u  not  equal  to  tt/2. 

To  make  this  problem  somewhat  more  tractable,  we  v/ill  presume 
that  the  same  axis  A  and  radius  function  r  are  to  be  used  for  the 
SHGC  and  RSHGC.  (We  conjecture,  but  have  not  proven,  lhat  this 
presumption  implies  no  loss  of  generality.)  The  preblern  can  then 
be  stated  this  way:  Given  an  SHGC  G,  «  (A,  Cv  r,  o),  with  C,  = 
(t/j.v,),  can  some  function  C2  =  (u2  v2)  be  found  such  that  the 
RSHGC  G2  =  ( A ,  C2,  r,  a/2)  contains  the  same  points  as  G,?  This 
is  addressed  in  the  slant  Theorem ; 


A  non  degenerate,  Oblique  SHGC  G ,  has  an 
equivalent  Right  SHGC  G2  if  and  only  if  G,  is  Linear 
(figure  5), 


So,  for  each  LSHGC,  there  exists  another  description  of  the  same 
shape  which  is  both  an  LSHGC  and  RSHGC,  containing  all  the 
same  points  (hut  without  bovefen  ends),  In  this  sense,  the  set  of 
LSHGCs  is  a  subset  of  the  set  of  RSHGCs.  Also,  it  doesn't  matter 
what  direction  the  cross-section  planes  are  taken  relative  to  the 
axis  of  an  LSHGC:  for  any  direction  some  cross-section  function  C 
can  be  found  to  describe  the  shape  as  /in  SHGC  (ignoring  the 
possible  beveling  of  the  ends). 


On  the  other  hand,  if  an  SHGC  is  not  High*  or  Linear,  then  there 
is  no  Right  SHGC  which  contains  the  same  points. 

4.2  The  Alternate  Axis  Problem 

Having  addressed  the  issue  of  changing  the  cross-section 
planes,  we  can  ask  about  moving  the  axis:  For  what  SHGCs  are 
there  equivalent  representations  with  different  axes,  using  the  same 
cross-soclion  planes?  (This  is  known  to  involve  a  loss  of  generality 
with  respect  to  the  question:  For  what  SHGCs  are  there  equivalent 
representations  with  dilferent  axes?  For  example,  a  splice  satisfies 
the  latter  condition,  but  not  the  condition  we  are  addressing  here. 
We  conjecture  that  only  shapes  resembling  certain  regular 
poiyliedra,  of  which  the  sphere  is  the  limiting  case,  are  excluded 
from  our  analysis  herein  by  the  restriction  to  use  the  same  cross- 
section  planes.)  We  will  begin  by  resti  icting  tho  problem  so  that  the 
two  axes  intersect  somewheie  and  so  that  both  axes  intersect  the 
cross-section  planes  (i.e.  the  axes  of  the  SHGCs  are  not  parallel  to 
the  cross-section  planes). 


no  axis  possible 
'-in  this  space 


Figure  6:  Tho  Pivot  Theorem 

i  hi s  situation  is  addressed  by  the  Pivot  Thoorem: 

A  non-degenerate  SHGC  can  bo  described  as  another 
SHGC  with  a  different,  intersecting  axis,  and  the  same 
cress-section  planes,  if  and  cnly  if  it  is  Linear.  If  it  is 
Linear,  then  it  can  be  so  described  using  any  axis  which 
passes  through  the  apex  of  the  shape  and  does  not  lie  in 
the  imago  of  the  shape  projected  through  the  apex 
•  (figure  6). 

As  in  the  Slant  Thoorem,  the  different  representations  of  the  shape 
may  have  different  beveling  of  the  ends.  The  important 
consequences  of  the  Pivot  Theorem  are  tha*  a  Linear  SHGC  can 
have  (almost)  any  axis  passing  through  its  apex  and  that  any  non- 
Linear  SHGC  can  have  only  one  possible  axis  under  the  conditions 
stated  above.  A  proof  of  the  Pivot  Theorem  is  presented  in  [8]. 

5  Contours  of  Tangency  for  Right  SHGCs 

Suppose  we  have  an  SHGC  and  we  project  it  along  the  direction 
of  a  vector  VE  =  (awe,  ave,  *'-se)ws  (a  line  of  sight),  as  shown  in 
figure  7.  The  arcs  along  which  tho  surface  is  tangent  to  the  line  of 
sight  as  seen  from  direction  VE  (i.e,  occlusion,  or  parallelism  to  VE) 
will  be  projected  by  tho  ends  of  the  SHGC,  or  where  N  1  VE,  i.e,  N  • 

VE  "  0.  The  points  on  the  SHGC  projected  onto  such  contours  are 

called  contour  generators  (5).  (Of  course,  if  the  SHGC  is  opaque, 
some  of  the  contours  may  be  hidden  frem  view.)  Contours  are 
important  because  they  are  tho  usual  loci  of  discontinuities  of 
brightness,  texture  gradients,  etc.  in  an  imago  of  acurved  surface. 


Figure  7:  Contours  and  Contour  Generators 


Figure  0:  Viewing  Direction  and  Angle 

The  discussion  of  imaging  in  this  paper  will  bo  primarily  limited  to 
orthographic  projection,  in  which  all  lines  of  sight  are  parallel. 
Also,  unless  otherwise  slated,  wo  will  presume  iri  the  following 
discussion  of  projection  and  imaging  that  wo  are  dealing  only  with 
Right  SHGCs,  i.e.  SHGCs  with  cross-sections  perpendicular  to  the 
axis.  As  shown  in  figure  8,  this  allows  the  simplification  of  rotating 
the  w-v  axes  as  desired;  in  particular,  we  will  presume  that  V[:  is  in 
the  IV  S  plane,  i.e.  Avp  =  0.  Without  loss  of  generality,  we  can  then 
presume  'hat  VE  is  between  -  W  and  S;  if  tho  angle  from  VE  to  S 
(the  viewing  angle )  is  <j,  then  VE  =  [-sin  u,  0,  cos  a).  Additional 
simplification  arises  for  an  RSHGC  since  sin  «  =  1  and  cos  a  =  0. 
Fot  an  RSHGC,  using  N  VE  =  0,  the  contour  generator  points  must 
satisfy 


dvc  dr 

0  =  sin  <j - -  r  h{t)  cos  a  —  (5-1) 

df  ds 

In  an  end  view  or  side  view  (a  =  0  or  =  90°),  the  contour 
generators  are  always  planar,  hut  in  an  oblique  view,  the  contour 
generators  are  generally  not  planar  unless  the  shape  is  a  Linear 
Right  SHGC  (see  the  discussion  in  [8]). 

5.1  Images  of  Right  SHGCs 

The  assignment  of  world  (scene)  coordinates  ir>  shown  in  figure 
8.  It  involves  aligning  V  vertically  ( V  =  V),  and  placing  the  line  of 
sight  VE  on  Z [VE  -  Z  =  (0,0,1)).  (Recall  here  that  un-subscripted 
coordinates  are  given  in  the  x-y-z  system.)  Then  San-!  V'/are  in  the 
horizontal  (X-2)  plane.  Note  here  that  the  origin  is  at  (0,0,0) 
rather  than  at  tile  under  orthograph,,  this  changes  no 
important  geometric  relationships. 
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Figure  9:  Object  and  World  Coordinate  Systems 

The  image  of  a  point  P(s,t)  on  an  RSHGC  tinder  orthography  is 
thus 

l{s, I)  -  (x,y)  (3,1)  =  (uc(/)  r(s)  cos  o  +  s  sin  a,  vc(!)  r( s)) 

We  are  presuming  here  that  there  Is  no  scaling  difference 
between  w-v  s  and  x-y-z  coordinates. 

5.2  Contour  Generators  Under  Perspective  Projection 

We  can  also  analyze  images  of  BSHGSCs  under  perspective 
projection.  The  contour  generator  analysis  itself  can  be 
accomplished  by  considering  tne  eye  (center  of  the  lens)  to  be 
located  at  a  point  PE  =  (w|r,  vE,  sE)  in  the  object-centered 
coordinate  system,  as  in  figure  10.  Then,  at  each  point  P(c,t)  on  the 
surface  cf  the  object,  the  line  of  sight  is  the  vector  VE(s,f)  from  tbs 
eye  to  P(s,t),  defined  by: 

VE(s,t)  -  P(s,t)  -  PE 

=  (uc(t)  r(s )  -  wE.  vc(/)  r(s)  -  vE,  s  -  sE) 

Along  a  contour  generator,  we  still  have  W  -L  Vr_,  so  0  =  N '  V,  =  N 
•  (PM  -  PE). 


Figuie  10:  Imaging  Under  Perspective  Projection 


6,  Contour  Formation  by  Right  Circular  SHGCs 

For  solids  of  revolution,  which  arc  Right  Circular  SHGCs,  there  is 
considerable  simplification  in  the  orthographic  projection  and 
ima<  ing  relationships.  Recall  that,  for  a  Circular  SHGC,  uc(l )  =  cos 
2 n!  and  vc(/)  =  sin  2nt.  Using  equation  (5-1),  the  contour 
generators  for  an  RCSHGC  must  satisfy 

dr 

0  -  sin  o  cos  2i7/  +  cos  a  — 
ds 


eye 

Figure  11:  Contour  Generator  on  a  Right  Circular  SHGC 

and  therefore,  as  shown  in  figure  1 1 , 

1  .  rJr  . 

/  = - cos' (  -CO to—)  (G-1) 

2t j  ds 

This  equation  is  of  fundamental  importance,  since  it  expresses  1  as 
a  function  of  s  along  the  contour  generator.  Thus,  given  a  radius 
function  r(s)  of  a  solid  of  revolution  and  a  viewing  angle  a,  the 
above  equation  tells  exactly  how  the  contour  generator  moves 
towards  and  away  from  the  viewer. 

Now,  since  t  is  a  function  of  s  along  the  contour  generator,  the 
points  P(s,  1)  along  the  contour  generator  can  bo  specified  as 
PL  i(s),  a  function  of  s  only: 

Pgglj)  =  (cot  a  r(s)  — ,  ±r(s)  '/ 1  -  cot2  a  (dr/ds)2  ,  s) 

The  contour  generator  includes  points  such  that  the  v-coordinate 
of  PCG(s)  is  defined,  i.e.  |  dr/ds  |  <  |tan  <j|.  The  contour  generator 
is  not  generally  planar  in  an  oblique  -iew. 

On  an  RCSHGC,  the  image  of  a  point  PCG(s)  on  the  contour 
generator  Is 

<XCG'  VcgM 

=  (  -  r(s) - - - +  s  sin  tr,  r(s)\/l  -  cot2  a  (dr/ds)2  )  (6-2) 

sin  a  ds 

Further,  the  slope  of  the  image  contour,  dyCG/dxCG,  can  be 
determined  as  a  function  of  dr/ds  using  the  above  equation: 

— 22-  =  (  1  /  \Jsin 2  a  -  cos2  a  (dr/ds)2  )  —  (6-3) 

d*CG  ds 

6. 1  Occlusions  and  Singularities  in  Image  Contours 

Whore  |dr/ds|  >  |/an  o\,  there  will  be  no  points  on  the  contour 
generator.  This  causes  occlusion  of  the  contour  generator  from 
view,  with  resulting  discontinuity  in  the  visible  contours. 

Figure  12  shows  an  object  seen  from  a  side  view  (a  =  w/2)  In 
this  view,  tan  o  is  infinite  and  |dr/ds|  <  tan  o  for  all  s.  There  will  be 
a  single  continuous  contour  generator  on  the  object,  which  v/ill  in 
fact  be  planar  (running  along  the  top  and  bottom  of  the  object) 
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showing  contour  generator 


Figu re  1 2:  Contour  Generator  in  Side  View 


Figure  13  shows  what  happens  as  as  we  rotate  the  object  slightly 
decreasing  a  and  hence  tan  </.  As  long  as  |dr/ds|  <  tan  a 
everywhere,  the  contour  generator  will  still  be  continuous. 
However,  it  will  no  longer  be  planar  (unless  the  object  is  also  a 
Linear  SHGC,  which  we  will  not  consider  further  here).  From 
equation  (6-1),  we  see  that  where  dr/ds  is  0,  f  =  1/4,  i.e.  the 
contour  generator  is  on  "top"  ef  the  object.  (Of  course,  there  is 
also  an  identical  contour  generator  on  the  bottom.)  Where  dr/ds  < 
0,  f  >  1/4  and  the  contour  generator  is  pushed  away  from  us;  where 
dr/ds  >  0,  the  contour  generator  is  pu!i<'« I  towards  us.  Note  also 
that  |dr/ds|  >  \tan  <r|  at  the  ends  of  the  object,  hc-nce  the  contour 
generator  no  longer  includes  the  ends  as  the  object  is  slightly 
rotated  away  from  a  side  view. 

Let  us  presume  for  the  moment  that  the  object  is  thinner  at  the 
near  end,  i.e.  dr/ds  <  0  towards  that  end.  Eventually,  as  shown  in 
figure  14,  we  rolate  ttie  object  so  much  that  dr/ds  »  -tan  <j  at 
some  value  of  s,  say  sm,  where  dr/ds  is  at  a  minimum  At  this  value, 
since  dr/ds  |3  =  -tan  a  and  c*2r/ds2  |s  ^  0  (because  sM  is  a 
relative  minirmFi'i!  for  dr/ds),  we  have d*CG/fls  |s  =0  and  dyGG/ds 
lsn  ~  °-  Thu3.  the  contour  generator  is  tangenf'to  the  line  of  sight 


top  view 

showing  contoui  generator 


top  view 

showing  contour  generator 


Figure  1  3;  Contour  Generator  in  Near-Side  View 


Figure  14:  Contour  Generator  Tangent  to  L  ine  of  Sight 

Figure  15  shows  what  happens  when  wo  rotate  the  object  yet 
farther  There  is  an  interval  (sr  sh)  around  sm  in  which  dr/ds  < 
-tan  a,  i.e.  for  which  no  contour  generaior  points  exist.  What  has 
happened  is  that  the  former,  single  contour  generntci  has  been 
split  into  two  separate  contour  genera  tors,  corresponding  to  values 
of  s  such  that  s  >  sb  and  s  <  sa.  Along  the  contour  for  s  >  sb,  all 
points  will  Ire  visible  in  the  image  (i.e.  none  are  occluded  by  the 


As  stated  earlier  the  above  analysis  presumes  that  a  is  known 
and  that  the  image  is  appropriately  aligned.  Figure  1 A  shows  that,  if 
<r  is  tot  known  in  advance,  the  image  of  the  contours  of  a  solid  of 
revolution  is  ambigimus,  with  nne  degree  of  freedom  (<r).  Part  (a)  of 
figure  19  shows  a  sond  of  revolution  from  a  side  view:  (b)  shows  the 
image  contours  produced  when  the  shape  is  viewed  fiom  a  viewing 
angle  of  d.'5°  Part  (c)  shows  the  sot  of  diflc-rent  solids  of  revolution 
which  might  have  produced  tho  figure  if  viewed  from  various 
angles.  (Thick  lines  in  (c)  correspond  to  visible  contours;  thin  lines 
are  portions  of  r(s)  estimated  by  the  method  For  each 

possible  value  of  a  within  some  interval  of  possibilities,  a  plausible 
function  r(s)  can  be  determined  whose  image  contours  match  those 
actually  seen. 

image  x,y 
jjy/dx 

f 


r> 
/  \ 


vvy 


solid  ol  revolution,  side  view 


(b)  image  of  solid  seen  at  45  degrees 


description  r(s) 

rir/ds 


Figure  18:  Contour  Analysis  Results  in  Shape  Description 

7.2  Occluded  Contours  and  Silhouettes 

We  have  seen  how  to  analyze  contours  to  determine  values  of 
r(s);  now,  we  will  discuss  how  mush  of  r(s)  can  bo  reconstructed  in 
this  manner  (i.e.  over  what  range  of  value,  f  s)  As  we  already 
know,  there  is  a  contour  generator  only  wh  |dr/ds|  <  \ta/i  <r|; 
values  of  s  for  which  |dr/ds|  >  \ian  a |  therefore  do  not  correspond 
to  any  points  on  a  contour  generator,  and  r(s)  cannot  be 
determined  for  these  values  by  examination  of  imago  contours.  In 
addition,  as  described  in  section  6  I.  the  object  itself  may  occlude 
portions  of  the  contour  generator  from  view. 

For  analyzing  a  silhouette,  exactly  the  same  methods  and 
conditions  apply,  except  that,  using  the  notation  of  section  6.1,  tho 
contour  for  s  <  sc  will  render  invisible  the  contour  for  s  >  sb  which 
lies  to  the  left  of  Iq^[scY,  thus,  in  figure  1C,  only  segments  A  and  fl 
will  be  visible.  Silhouettes  are  simply  images  of  contours  in  which 
certain  portions  of  some  contours  are  not  visible. 

If  dr/ds  >  0,  the  situation  is  just  the  above  viewed  from  the 
opposite  direction  (i.e  segments  A,  8,  and  X  in  figure  16  will  be 
visible  and  segment  Y  will  be  occluded).  In  this  case,  when  the 
contour  generator  splits,  the  arc  for  s  <  sa  is  still  occluded,  but  the 
arc  for  s  >  sb  is  u  closed  curve  in  the  image  rather  than  flaring  out 
as  above.  Silhouette  analysis  will  lie  identical;  indeed,  the 
silhouette  of  an  object  is  identical  (to  within  a  reflection)  viewed 
from  opposite  directions. 


(c)  Possible  interpretations  of  image  at  various  viewing  angles. 
Thick  lines  are  visible  portions;  Ihin  lines  are  estimated. 
Interpretations  are  scaled  to  same  horizontal  size. 


Figure  19:  Contours  of  a  Solid  of  Revolution  are  Ambiguous 

in  any  of  these  cases,  there  will  be  an  interval  of  s  over  whicli  r(s) 
cannot  be  computed,  say  (sjp  s.).  However,  we  can  compute  /(s.), 

r(s(),  dr/dsjs,  and  dr/ds)...  For  practical  image  analysis,  it  is 
possible  to  Estimate  r(s)  olrer  (s,.  s.)  by  fitting  a  function  which 
conforms  to  these  constraints.  For  example,  a  cubic  polynomial 
can  be  (it  to  the  data.  If  we  let  r(s)  =  as3  +  6s2  +  cs  +  d,  we  tliea 
have  dr/ds  =  3as2  +  26s  +  c.  Then  the  following  system  of  linear 
equations  can  easily  be  solved  to  determine  the  values  of  a,  6,  c, 
and  d: 
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7. 3  Aligning  the  imago 

The  above  analysis  has  presumed  that  we  know  the  viewing 
angle  o  and  have  aligned  the  images  cl  the  axis  endpoints  A(0)  and 
/'.(I)  onto  (0,0)  and  (sin  a,  0),  respectively.  We  will  now  address  the 
probl  ns  of  aligning  the  image  and  determining  a. 

Suppose  we  aro  given  an  image  of  the  contours  of  a  Might 
Circular  SHGC,  arbitrarily  scaled,  rotated,  and  translated,  and 
viewed  horn  an  unknown  angle  We  can  immediately  determine  the 
imago  of  the  axis,  since  this  will  ho  on  axis  of  symmetry  in  the 
image,  and  rotate  the  image  so  this  is  horizontal.  By  translating  the 
image,  this  axis  can  be  made  to  line  up  with  the  x-axis.  Brooks  [3] 
and  Marr  and  Nishihara  [6]  discuss  the  issue  of  the  finding  the 
image  of  the  axis  of  a  generalized  cylinder. 

We  must  next  determine  which  end  of  the  object  is  nearer,  so  this 
can  be  placed  on  the  right  as  in  our  imaging  model.  If  the  left  end  is 
closer,  we  v/ill  need  to  reflect  the  image  about  the  y  axis,  or 
equivalently  rotate  it  180°  about  the  origin. 


point,  we.  saw  that  a  natural  set  of  object- centered  coordinates  can 
be  used  to  specify  the  positions  of  points  on  the  surface  of  an 
SHGC  and  the  directior.  of  the  surface  normal  at  eacli  point. 

We  looked  at  the  Equivalence  Problem,  and  saw  that  only  Linear 
SIHGCs  (i.e.  those  with  radius  propoitional  to  distance  from  some 
apex  point  nn  the  axis)  can  be  described  in  more  than  one  way  witli 
dillereni  cross-sections  or  (under  some  restrictions)  will  different 
axes.  This  means  that  an  an  hourglass-shape,  for  example,  can 
only  be  described  as  an  SHGC  in  one  way. 

Next,  we  looked  at  the  analysis  of  tangency  contours  of  SHGCs. 
We  began  by  presenting  the  condition  which  must  be  satisfied  by 
the  pornts  on  the  contour  generator  of  any  SHGC.  Right  SHGCs 
allow  an  important  simplification  of  this  condition.  The  contour 
generator  is  planar  in  an  end  view  or  side  view  of  any  Right  SHGC, 
and  in  any  oblique  view  of  a  Right  L  inear  SHGC.  In  an  oblique  view 
of  a  non  Linear  Right  SHGC,  it  is  difficult  to  determine  whether  the 
contour  generator  is  planar. 


We  must  also  determine  a  and  find  (lie  images  of  the  axis 
endpoints.  Determining  a  is  more  important,  since  it  affects  both 
the  image  alignment  and  the  shape  of  the  reconstructed  function 
/(■>).  The  axis  endpoints,  on  the  other  hand,  aflect  only  the  scale 
and  shifting  (along  s)  of  r(s). 

Figure  20  shows  an  object  whose  closer  end  is  Flat,  i.e,  r(1)  >  0. 
The  edge  of  the  cioss- section  at  that  end  will  produce  a  contour  in 
file  image,  which  will  be  a  vertically  elongated  ellipse.  This  is  a  very 
useful  configuration,  since  we  can  easily  determine  which  end  of 
the  object  is  closer.  Then ,  we  know  that  the  center  of  the  ellipse 
must  be  the  imago  of  the  axis  endpoint  71(1);  further,  we  can 
compute  the  viewing  angle  a  from  the  eccentricity  of  the  ellipse, 
using  cos  a  =  b/n,  where  a  and  b  are  the  semimajor  and  semiminor 
axis  lengths,  respectively. 


Solids  of  revolution  (Right  Circular  SHGCs)  are  amenable  to 
much  more  detailed  analysis.  It  is  possible  to  desciibe  the  visible 
contours  in  an  image  of  a  solid  of  i  evolution  as  they  depend  on  the 
radius  function  and  the  viewing  angle;  these  relationships  can  be 
inverted  'o  determine  the  radius  function  from  the  image  contours, 
when  tlie  viewing  angle  is  known  in  advance.  If  the  viewing  angle  is 
not  known,  the  image  of  a  solid  of  revolution  is  ambiguous  with  coe 
degree  of  freedom.  The  viewing  angle  can  be  determined  if  the 
object  has  a  visible  flat  end.  Silhouettes  can  be  analyzed  by  the 
same  methods  used  for  tangency  contours. 

It  was  also  observed  that  a  "special  viewpoint"  exists  when  the 
contour  generator  is  tangent  to  the  line  of  sight.  For  a  solid  of 
levolufion,  this  occurs  when  the  viewing  angle  is  equal  to  a  local 
maximum  in  the  angle  between  the  radius  function  and  the  s  axis. 


Figu  re  20:  RCSHGC  With  Near  End  Flat 
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An  Algorithm  to  Display  C«  nomli  r<|  Cyhinlrrs 


o 

o 

a. 


Ifirlmid  ficoll 

Depart  incut  of  (. 'oin f»n f.i'r  ‘Vit-nco 
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A I  ist.rnct 


Tins"  p; i()(' r  describes  ;m  algorithm,  capable  of 
parallel  iuifileinenlatii'u,  to  ralrul.itc  I  hr  pi is", act. ive 
image  ol  ;i  Genera list'll  (Cylinder,  I'roiii  nrSbt r.nry  view¬ 
point,  vvil.li  hidden  surface  removal.  If  j > |>l iro.  |.o  a  vide 
class  f  c- y  I i 1 1 < I ( * r s .  Tiro  lime  taken  will  li  >  proportional 
l.o  flic  fotiil  length  ‘I  file  contours,  i  •  i  d- *  j :  t'rnlc  n  L  of  the 
mi  in  I  r  if  edges,  rhe  algoril  Inn  (lives  for  one  closctl- 
loo|i  ronlour  gene  :■  for  nl.  ;i  lime,  losi.ing  its  contour 
(m  I  lie  image  [>lnne)  for  intersec  I, ion  willi  v i. . i I jIc  sege- 
meitf:i  of  [ire  vioi in  re  nl o nrs.  The  input,  nre  I  he  fn net  ions 
desc rilling  the  olijecf,  along  ’.villi  t  he  podt.ien  of  the  eve; 
and  the  output  are.  the  visible  hiiiT. -ices  ;nid  edges. 


1.  Introduction 


Model  Inured  v  ision  systems  predict  I  In'  appearonco 
ol*  llit  ’r  models  by  calculating  the  perspective,  projection 
or  “image”,  from  dilferent  viewpoints.  In  tin'  past,  this 
has  only  hei'n  done  for  models  with  annlv  l.ic,  closed  form 
solution  to  the  perspective  projection  :  •  [nations.  Since 
Generalised  Gy limlers  are  u  nntiiril,  mid  commonly 
used  model,  ;v  general  method  for  getting  their  imago 
will  lie  useful.  From  ;i  2  H  Computer  Graphics  view¬ 
point,  the  iilgorifhin  presents  mi  alternative  to  building 
models  out  of  siirfnce  patches  and  planar  snrlmes  slid 
it  is  eipmlly  general.  Most  previous  methods  liir  dis¬ 
playing  parametric,  surfaces  have  ignored  the  stnic.l.uie 
of  the  scene;  for  example  scan  line  algorithms  can  he 
fooled  by  certain  folds  in  the  surface. 

The  overall  idea  of  the  algorithm  is  to  look  for 
those  lines  on  the  models’  surface  whose  points  arc  tan¬ 
gent  to  the  line  of  sight,  called  “Contour-Generators”. 
Those  form  (dosed  loops.  If  one  point,  is  known  it  can  he 
ollii  lenfly  propagated  to  form  file  whole  line,  by  using 
a  simple  expression  for  the  routonr-gcucrator  tangent 
vector.  Surfaces  become  occlud  d  when  they  pass  tie- 
hind  coutour-gcucrators.  So  by  calculating  where  the 
perspective  projections  of  th  •  contour-generators,  railed 
“contours”,  cross  each  other,  the  visible  outlines  of  •  lie 
model  are  found.  The  visible  surfaces  are  produced  by 
expanding  the  regions  attached  to  the  front  side  of  each 
contour  generator  until  their  image  er'i.'...es  a  e'Mitour. 
The  method  is  a.  gcner;  lisal.ion  of  one  used  in  [3],  for 
volumes  hounded  by  planar  faces. 


'I  he  image  of  a  scene  ((insisting  of  several  larger 
ol'jei  fs,  mad.,  op  of  intersecting  models,  is  obtained  one 
mo<h  I  at  a  time.  Negalive  volumes  or  hobs,  in  objects 
(■in  also  lie  displayed.  The  models  themselves  can  have 
smootHy  changing  surfaces  and  edges. 

The  rest  ot  the  paper  will  he  split  into; 

2.  C(  in  valised  Cylinder  definition 

a.  Outline  of  the  algorithm. 

1.  Imp  lenient  at  ion  of  (.In*  definition  in  2.,  and  dis- 
cu.sion  nl  i.liose  ( ,eni  rail  'd  Cylinder.'!  . i.e (•(; p I  ed  by  the 
algorithm,  (almost  all  of  th  in). 

>)•  I’ ■  .  uIIh,  ('([nations,  formulas,  n  ! erred  to  in  the 
algo  'it  1 1  •  r  i  <ul  line  3.,  and  Gcn.rdised  Cylinder  deriva¬ 
tive  functions. 

f>.  Comparison  to  other  Algorithms. 

7.  Current  stute  of  the  implementation. 

8.  bixt. 'unions  and  summary. 


2.  Qc n i' r;; lined  Cylinder  Definition 

Cciicrnh.icd  Cylinders  were  Urst  introduced  by 
I  in  ford  [  I).  When  a  planar  shape  sweeps  through  space, 
Mu'  volume  it  passes  through  is  a  Gcner;. lised  Cylinder 
(CC).  If  creutly  Shafer  [2],  lias  made  a  pfociv’dcliiiilioii, 
vrliidi  I  give  here. 

(i)  .'’here  is  a  space  curve,  called  the  axis  of  the 
sliiijie. 

(ii)  At  each  point  in  the  axis,  at  some  fixed  angle 
to  (he  axis,  there  is  a  cross  section  plane  dcliucd. 

(iii)  On  each  such  cross-section  plane,  there  in  a 
planar  curve  which  cons' il  nl.es  the  cross-section  of  the 
object  on  t hat  plane. 

(iv)  There  is  a  deformation  ruli  which  apecilics  the 
l.ransloi'uiatioii  oT  h  cross-section  as  the  cross-section 
plane  i,.  swept  along  I, he  axis.  This  rule  always  imposes 
(at  least)  the  cousin  ini  that  the  cross-section  changes 
smoothly, 

( v )  I  'dd  the  furl, In  r  (inditiou  I, hat.  lively  point,  in 
the  object  lined  lie  cn  <  .icily  one  cross  si'Clinu  plane. 


l’ln.;  is  Hit'  class  of  .shapes  t.hul  the  algorithm  deals 
Willi,  inong  with  the  rest c i c (ion  that  the  cross-section 
curve  is  dosed 


3.  Outline  of  A Igoritlim 

'I  he  algorithm  can  be  divided  into  two  parts, 
hirst  (A),  solution  lor  the  visible  parts  of  the  contour- 
generators,  and  secondly  (II),  region  growing  to  get 
visible  surfaces.  The  lirst  part  is  the  principal  one. 
It  has  two  snbparts,  which  are  repealed,  and  together 
li.ad  one  contour  generator  Uadi  contour  generator  is 
a  closed  loop,  inter  ecting  no  others,  which  divides  the 
surface  into  forward  (visibie  if  imoccliided),  and  back 
ward  facing  (invisible)  areas.  The  square  roof  of  the;  size 
of  each  visible  area,  is  a  measure  of  tin;  length  scale  over 
which  things  are  happening  in  that  region  of  the  GC. 

I  he  first  suhpart  (a  I ),  steps  over  the  CIO  with  step 
length  proportional  to  the  square  root  of  the  area  of 
the  region  if  is  contained  hy,  until  either  the  whole 
surface  1  been  covered,  in  which  case  the  algorithm 
stops;  or  a  step  containing  a  new  contour-generator  is 
found.  In  I  his  case,  the  step  is  then  bisected  down 
to  an  exact  solution.  A  lest  to  see  whether  each  step 
jumps  a  new  contour  generator  can  be  made.  For 
whenever  the  direction  (forward  or  hack),  that  a  sur¬ 
face  point  is  facing,  differs  from  the  direction  predicted 
by  flu;  regions  ol  flu;  existing  contour  generators,  then 
(here  must  In;  an  undiscovered  contour-generator  pass¬ 
ing  nearby.  Phis  means  that  if  one  stepping  point  has 
•lie  same  predicted,  and  actual  surface  direction,  and 
the  next  does  not,  then  a  now  contour  generator  passes 
t.li rough  the  intervening  step.  This  interval  is  reduced, 
using  bisection,  with  the  condition  that  one  end  of  the 
infei  val  must  have  flu;  same  predicted,  and  real  surface 
directions,  while  the  other  end  must  not. 

I  he  solution  is  handed  over  to  the  second  suhpart 
(Ad),  which  propagates  if  around  the  whole  contour 
generator,  back  to  its  start,  making  a  list  of  the  sole 
fions  as  it  goes,  and  noting  the  ones  where  the  contour- 
generator  becomes  visible  or  occluded.  It  works  by  step 
ping  along  the  contour-generator  tangent  (2)(sec  sec¬ 
tion  >>.),  to  get  a  guess  lor  the  next  solution  point, 
which  is  Newlnn-Raphson  il crated  to  a  sullicierit  ac¬ 
curacy  (bisection  5.).  II  the;  Newton  Raplison  docs 
nol  converge,  several  points  around  a  small  circle  arc- 
tested  to  find  in  interval  to  bisect  down  u>  the  next 
solution,  1  lie  step  length  is  taken  proportional  to  cur¬ 
vature  of  flu;  contour,  to  get  uniform  interpolation  ac¬ 
curacy  between  the  known  contour  points  (:{),( I).  Kar.li 
step  is  projected  to  the  image,  and  checked  for  intersec¬ 
tion  with  those  previously  projected  stops,  which  have 
not  been  shown  to  be  bidden.  When  an  intersection  is 
loimd,  the  exact  posit, ions  of  the  occluding  and  occluded 
contour  generator  points  are  c.nl<:nlalcd(8).  Finally  the 
whole  confoui  is  checked  against  possible  surrounding 
(onto  irs. 


lo  convert  to  a  form  imph-mr ntahle  i:i  p  i  illcl 
slap  A  I  is  done  independently,  at  different  pul., is  on 
the  f*('  mid  then  A2  is  used  to  form  contour-  gen  r  for 
segments,  which  can  be  simply  joined  up  into  the  com¬ 
plete  contour  li.s.  s. 

hither  way,  each  list  of  contour  points  is  now  fol 
lowed  down,  keening  count  of  1 1  e  marked  occluded 
points,  to  produce  lists  of  just  the  visible  ones. 


d.  In  plcrnentatioii  of  Definitions 

'Plte  definition,  from  section  2,  can  be  implemented 
as  a  bivariate  surface,  with  s  paramel  rising  the  axis  and 
I  parametrising  the  cross  seel  ion  curve,  both  roughly 
proportional  to  arc  length,  0  <  a,t  <  I  .  If  N  denotes 
the  outward  normal,  (f  ,  s  ,  N)  form  a  right  handed  set. 
I  be  quantities  that  have  to  be  defined  are. 

h  flu;  position  vector  of  the  eye 

^’(s)  Hie  axis  (or  pine)  (can  use  intrinsic 

coordinates)  0  <  »  <  1. 

n  the  angle  between  t  he  y-axis  of  the  cross- 

section  plane  and  the  spine. 

('(M)  the  closed  cross  section  curve,  (liven 

s,  t  parametrises  flu;  curve  so  that  (’((),  ,s)  =  (/(I,*), 
with  f  roughly  proportional  to  arc  length. 

1' mictions  for  the  deformation  rule: 

lx(-s)  a  twisting  angle,  giving  rotation  of 

flic  cross-section  in  ils  own  plane. 

Derivable  from  these  are: 

I’M  a  point  on  the  (10  surface. 

/'  (<,  s)  =ss  /’  —  K  the  line  of  sight  vector, 

N(t,s)  the  normal  to  the  surface. 

N.l1'  --  0  is  the  equation  to  he  solved.  (See  the 
results  and  formuals  section)  where  .  means  scalar 
product. 

rl  he  algorithm  will  work  for  arbitrary  cross  section 
0(f,.s).  Rut  any  0(l,s)  can  he  decomposed,  (to  desired 
accuracy  hy  taking  enough  terms),  info  separable  form, 
hy 

=nW+w,(f)  + I  r,(,y (;.,(!)  f... 

where  the  rt-  are  called  “radius”  functions. 

To  sec  this  take,  for  example,  (,’,(()  =  C(t,  *■)  for 
()<»,<  n  and  lake  7\(.i)  to  interpolate  smoothly  be¬ 
tween  the  cross-section  (7,  functions.  Then  as  a  >  oo, 
the  ( xpre.iaiim  on  rim  ->  'Phis  decomposition 

also  has  uselul  properl  es  win  n  a  small  number  of  f<  nils 
are  taken,  eg.  with  two  terms,  a  cross  section  shape 


1 


(1 


(‘.‘in  bo  stretched  in  I. ,v iiidcpvndci) L  directions;  or  ruled 
surfaces  cm  be  lo.  mod;  or  a  square  <  ri  smoothly 
d«  for  mod  into  a  circle  (l  Ins  is  the  simp.-  of  many  colfee 
j.irs). 

Tor  example  tlio  following  funds  ,nr;  merge  a  square 
into  ;i  aircle,  along  a  parabolic  axis. 

f'i  (co.*'?.nl,  siti'lnt) 

('•>  -  (I  'I/,  U)  ()</.<  { 

C-i  =  (I  -4/.,  2  it]  \  <  l  <  > 

C>  =  (~3  t  41,  2  4i)  i  <  t  <“3 

6'2  =  (-3  +  'U,  -4  |  U)  <  l  <  1 

r(  —  s 
r2  —  I  —  s 

■ST  (20s,  I  Os2, 0) 

Conor  ilisod  Cylinders  can  bo  bnill,  up  into  more 
conipliciit  hI  modols.  They  can  describe  the  shape  of 
additions  and  also  of  holes.  “Negative  volume”,  that  is, 
a  hob  in  a  .olid  objo  I  ,  can  be  displayed  in  !.|,e  same  way 
as  everything  else,  This  is  done  by  distinguishing  two 
types  o!  contour  generator  points,  depending  on  bow 
the  I  in  ■  of  night  makes  a  tangent  there.  The  possibilities 
me,  that  the  tangent  lies  outside  the  surface  of  the 
model,  “miter”  contour -generator  points,  or  that  it  lies 
msido,  ‘inner”  contour  generator  point.  At  these  outer 
points,  the  adjacent  forward  facing  surface  is  (  loser  to 
the  eye  than  the  back  face;  the  opposite  is  true  for 
inner  points.  A  contour-generator  can  often  change 
from  consisting  of  outer  to  inner  points,  especially  in  a 
complex  scene.  To  giver  a  generalised  cylinder  negative 
volume,  and  display  it  as  a  hole,  the  outer  and  inner 
contours  need  to  he  swapped,  and  the  outward  pointing 
vectors  at  the  cutoff  ends  are  reversed. 


5.  Results,  Formulas 

Thin  section  contains  some  of  the  theory  that 
makes  the  algorithm  work  and  also  the  results  and 
derivation  of  equations  referred  to  in  the  algorithm  out¬ 
line. 

1.  Formula  for  l*(t,s),  a  point  in  the  CC  surface. 

P  =  SP  +  [lt]*[rotX ,  90  -  a\*[ralZ,lx]*G(t,a)  (1) 

Where'  Sl’(s)  tracks  out  the  spine  as  s  changes, 
0  <  »  <  I;  (K(s)j  represents  the  rotation  part  of  I  he 
transformation  matrix  into  the  untwisted  coordinate 
frame  on  the  spine;  and  tx  is  the  twisting. 

Let  A  —  -jj~-  ,  and  (a  b  c)  =  jA-  where  |/l|  ^  0. 
Then  (a  b  c)  is  a  unit  vector  tangent  to  the  spine. 


2.  Formula  for  the  outward  surface  norm  d,  N(l,s) 

W(f,.i)  X  -J--  where  X  means  vector  product 

3.  Form nla  for  the  tangent  to  a  contour-generator 
in  parameter  space. 

For  contour-generator  s(t),  /V (f ,  s)  P{t,  ,s)  =  (), 
where  the  dot  .  represents  the  scalar  product  and 
!'  I’  P  is  tin'  line  of  sight.  Differentiating  this 
with  respect  to  t,  gives  a  vector  tangent  to  the  coulour- 
generalor. 

Given  one  point,  P(t,s),  on  a  contour -generator,  V 
the  unit  tangent  vector  is  used  to  p,ot  a  good  estimate 
for  another  nearby. 

P{t  +  \*Vt,s  +  \*Va)  (3) 

Where  X  is  the  step  length. 

4(i)  The  vector  V  always  points  the  same  way. 
Imagine  looking  down  at  the  contour-generator  from 
outside  the  cylinder,  then  the  vector  has  the  forward 
lace  on  its  right,  and  the  hack  face  on  its  left,  when  (t, 
s,  N)  form  a  right  handed  set. 

Proof:  either  by  drawing  four  diagrams  (see 
diagrams);  or,  since  U  changes  continuously,  parallel  to 
tin'  contour-  generator,  change  of  direction  means  that 
U  —  0,  and  that  there  is  another  contour-generator 
at  that- point.  So  a  contour-  generator  branch  can  be 
chosen  to  keep  U  pointing  the  same  way. 

4(ii)  Adjacent  corilonr-gcnerators  point  in  opposite 
directions.  This  is  because,  if  the  intervening  surface 
faces  forward,  say,  then  it  must  be  on  the  left  hand 
side  of  both  contour-generators,  making  them  point  op¬ 
positely. 

4(i;i)  Contour-generators  cannot  cross.  Since  cross¬ 
ing  means  that  IJ  —  0,  (see  diag),  lake  the  continuation 
of  the  contour-generator  to  he  either  L  or  I?.,  whichever 
makes  the  smallest  angle  between  steps. 


,  [It]  is  untwisted  means  that  it  represents  a  rotation 
a.!  i  in  I  an  axis  perpendicular  to  the  /.-axis.  It  rotates  the 
7,  axis  into  (a  b  c).  It  works  out  to  be, 


5.  Step  length,  X,  along  V  is  taken  proportional  to 
the  curvature  in  the  image  plane.  'Phis  means  that  in¬ 
terpolation  in  the  image  plane,  joining  up  the  calculated 
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ABSTRACT 


O 

o 

In  this  paper  we  offer  a  critical  evaluation 
°f  the  partitioning  (perceptual  organization) 
problem,  noting  the  extent  to  which  it  has 
£~^distinct  formulations  and  parameterizations.  We 
.—  show  that  most  partitioning  techniques  can  be 
^'"•characterized  as  variations  of  four  distinct 
paradigms,  and  argue  that  any  effective  technique 
must  satisfy  two  general  principles.  We  give 
concrete  substance  to  our  general  discussion  by 
introducing  new  partitioning  techniques  for  planar 
geometric  curves,  end  present  experimental  results 
demonstrating  their  effectiveness. 


I  INTRODUCTION 


A  basic  attribute  of  the  human  visual  system 
is  its  ability  to  group  elements  of  a  perceived 
scene  or  visual  field  into  meaningful  or  coherent 
clusters;  in  addition  to  clustering  or 
partitioning,  the  visual  system  generally  Imparts 
structure  and  often  a  semantic  interpretation  to 
the  data.  In  spite  of  the  apparent  existence 
proof  provided  by  human  vision,  the  general 
problem  of  scene  partitioning  remains  unsolved  for 
computer  vision.  Furthermore,  there  is  even  some 
question  as  to  whether  this  problem  is  meaningful 
(or  a  solution  verifiable)  in  its  most  general 
form. 

Part  of  the  difficulty  resides  in  the  fact 
that  it  is  not  clear  to  what  extent  semantic 
knowledge  (e.g.,  recognizing  the  appearance  of  a 
straight  line  or  some  letter  of  the  English 
alphabet),  as  opposed  to  generic  criteria  (e.g., 
grouping  scene  elements  on  the  basis  of  geometric 
proximity),  is  employed  in  examples  of  human 
performance.  It  would  not  be  unreasonable  to 
assume  that  a  typical  human  has  on  the  order  of 
tens  of  thousands  of  iconic  primitives  in  his 
visual  vocabulary;  a  normal  adult's  linguistic 
vocabulary  might  consist  of  from  10,000  to  40,000 
root  words,  and  iconic  memory  is  believed  to  be  at 
least  as  effective  as  its  linguistic  counterpart. 
Since,  at  present,  we  cannot  hope  to  duplicate 
human  competence  in  semantic  interpretation,  it 
would  be  desirable  to  find  a  task  domain  In  which 
the  influence  of  semantic  knowledge  is  limited. 


In  such  a  domain  it  might  be  possible  to  discover 
the  generic  criteria  employed  by  the  human  visual 
system  and  to  duplicate  human  performance.  One  of 
the  main  goals  of  the  research  effort  described  in 
this  paper  is  to  find  a  set  of  generic  rules  and 
models  that  will  permit  a  machine  to  duplicate 
human  performance  in  partitioning  planar  curves. 


II  THE  PARTITIONING  PROBLEM:  ISSUES 
AND  CONSIDERATIONS 


Even  if  we  are  given  a  problem  domain  in  which 
explicit  semantic  cues  are  missing,  to  what  extent 
is  partitioning  dependent  on  the  purpose, 
vocabulary,  data  representation,  and  past 
experience  of  the  "partitioning  instrument,"  as 
opposed  to  being  a  search  for  context  Independent 
"intrinsic  structure”  in  the  data?  We  argue  that 
rather  than  having  a  unique  formulation,  the 
partitioning  problem  must  be  paramaterized  along  a 
number  of  basic  dimensions.  In  the  remainder  of 
this  section  we  enumerate  some  of  these  dimensions 
and  discuss  their  relevance. 

A.  Intent  (Purpose)  of  the  Partitioning  Task 

In  the  experiment  described  In  Figure  1,  human 
subjects  were  presented  with  the  task  of 
partitioning  a  set  of  two-dimensional  curves  with 
respect  to  three  different  objectives:  (1)  choose  a 
set  of  contour  points  that  best  mark  those 
locations  at  which  curve  segments  produced  by 
different  processes  were  "glued"  together; 
(2)  choose  a  set  of  contour  points  that  best  allow 
one  to  reconstruct  the  complete  curve;  (3)  choose  a 
set  of  contour  points  that  would  best  allow  one  to 
distinguish  the  given  curve  from  others.  Each 
person  was  given  only  one  of  the  three  task 
statements.  Even  though  the  point  selections 
within  a  task  varied  from  subject  to  subject,  there 
was  significant  overlap  snd  the  variations  were 
easily  explained  in  terms  of  recognized  strategies 
invoked  to  satisfy  the  given  constraints;  however, 
the  points  selected  in  the  three  tasks  were 
significantly  different.  Thus,  even  In  the  case  of 
data  with  almost  no  semantic  content,  the 
partitioning  problem  is  NOT  a  generic  task 
Independent  of  purpose. 


The  research  reported  herein  was  supported  by  the  Defense  Advanced  Research  Projects  Agency  under 
Contract  No.  MDA  903-83-C-0027  and  by  the  National  Science  Foundation  under  Contract  No  ECS-7917028. 
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B*  Partitioning  Viewed  as  an  Explanation  of  Curve 

Construction 

With  respect  to  "process  partitioning" 
(partitioning  the  curve  Into  segments  produced  by 
different  processes),  a  partition  can  be  viewed  as 
an  explanation  of  how  the  curve  was  constructed. 
Explanations  have  the  following  attributes  which, 
when  assigned  different  "values,”  lead  to  different 
explanations  and  thus  different  partitions: 

*  Vocabulary  (primitives  and  relations)  — 
what  properties  of  our  data  should  be 
represented,  and  how  should  these 
properties  be  computed?  That  is,  we  must 
select  those  aspects  of  the  problem  domain 
we  consider  relevant  to  our  partition 
decisions  (e.g.,  geometric  shape,  gray 
scale,  line  width,  semantic  content),  and 
enable  their  computation  by  providing 
models  for  the  corresponding  structures 
(e.g.,  straight-line  segment,  circular  arc, 
wlggly  segment).  We  must  also  allow  for 
the  appropriate  "viewing"  conditions;  e.g., 
symmetry,  repeated  structure,  parallel 
lines,  are  global  concepts  that  Imply  that 
the  curve  has  finite  extent  and  can  be 
viewed  as  a  "whole,"  as  opposed  to  only 
permitting  computations  that  are  based  on 
some  limited  Interval  or  neighborhood  of 
(or  along)  the  curve. 

*  Definition  of  Noise  —  in  a  generic  sense, 
any  data  set  that  does  not  have  a  "simple 
(concise)"  description  Is  noise.  Thus, 
noise  is  relative  to  both  the  selected 
descriptive  language  and  an  arbitrary  level 
of  complexity.  The  particular  choices  for 
vocabulary  and  the  acceptable  complexity 
level  determine  whether  a  point  is  selected 
as  a  partition  point  or  considered  to  be  a 
noise  element. 

*  Believability  —  depending  on  the 

competence  (completeness)  of  our  vocabulary 
to  describe  any  curve  that  may  be 
encountered,  the  selected  metric  for 
judging  similarity,  and  the  arbitrary 
threshold  we  have  chosen  for  believing  that 
a  vocabulary  term  corresponds  to  some 
segment  of  a  given  curve,  partition  points 
will  appear,  disappear,  or  shift. 

C  •  Representation 

The  form  in  which  the  data  Is  presented  (i.e., 
the  input  representation),  as  well  as  the  type  of 
d  ta,  are  critical  aspects  of  the  problem 
definition,  and  will  have  a  major  impact  on  the 
decisions  made  by  different  approaches  to  the 
partitioning  task.  Some  of  the  key  variables  are: 

*  Analog  (pictorial)  vs  digital  (quantised) 
vs  analytic  description  of  the  curves 

*  Single  vs  multiple  "views"  (e.g.,  single 
vs.  multiple  quantizations  of  a  given 
segment) 

*  Input  resolution  vs.  length  of  smallest 
segment  of  interest 


*  S imply— connected  (continuous)  curves  vs 

se1  f-intersecting  curves  or  curves  with 
"gaps" 

*  For  complex  situations,  is  connectivity 
provided,  or  must  it  be  established 

*  If  a  curve  possesses  attributes  (e.g.,  gray 

scale,  width)  other  than  "shape"  that  are 
to  serve  as  partitioning  criteria,  how  are 
they  obtained  by  measurement  on  an 
actual  'image,"  or  as  symbolic  tags 

provided  as  part  of  the  given  data  set? 

D.  Evaluat Ion 

How  do  we  determine  if  a  given  technique  or 
approach  to  the  partitioning  problem  is  successful? 
How  can  we  compare  different  techniques?  We  have 
already  observed  that,  to  the  extent  that 
partitioning  is  a  "well-defined"  problem  at  all,  it 
has  a  large  number  of  alternative  formulations  and 
parameterizations.  Thus,  a  technique  that  is 
dominant  under  one  set  of  conditions  may  be 
inferior  under  a  different  parameterization.  Never 
the  less,  any  evaluation  procedure  must  be  based  on 
the  following  considerations: 

*  Is  there  a  known  "correct"  answer  (e.g., 

because  of  the  way  the  curves  were 

constructed)? 

*  Is  the  problem  formulated  in  such  a  way 
that  there  is  a  ’provably"  correct  answer? 

*  How  good  is  the  agreement  of  the 

partitioned  data  with  the  descriptive 
vocabulary  (models)  in  which  the 

"explanation"  is  posed? 

*  How  good  is  the  agreement  with  (generic  or 
"expert")  subjective  human  judgment? 

*  What  is  the  trade-off  between  "false- 
alarms  and  misses"  in  the  placement  of 
partition  points.  To  the  extent  that  it  is 
not  possible  to  ensure  a  perfect  answer  (in 
the  placement  of  the  partition  points), 
there  is  no  way  to  avoid  such  a  trade-off. 
Even  if  the  the  relative  weighting  between 
these  two  types  of  errors  is  not  made 
explicit,  it  is  inherent  in  any  decision 
procedure  —  including  the  use  of 
subjective  human  judgment. 

In  spite  of  all  of  the  previous  discussion  in 
this  section,  it  might  still  be  argued  that  if  we 
take  the  union  of  all  partition  points  obtained  for 
all  reasonable  definitions  and  parameterizations  of 
the  partition  problem,  we  would  still  end  up  with  a 
"small"  set  of  partition  points  for  any  given 
curve,  and  further,  there  may  be  a  generic 
procedure  for  obtaining  this  covering  set.  While  a 
full  discussion  of  this  possibility  is  is  not 
feasible  here,  we  can  construct  a  counterexample  to 
the  unqualified  conjecture  based  on  selecting  a 
very  high  ratio  of  the  cost  of  a  miss  to  a  false- 
alarm  in  selecting  the  partition  points.  A  (weak' 
refutation  can  also  be  based  on  the  observation 
that  if  a  generic  covering  set  of  partition  points 
exists,  then  there  should  be  a  relatively 
consistent  way  of  ordering  all  the  points  on  a 
given  curve  as  to  their  being  acceptable  partition 
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points;  the  experiment  presented  in  Figure  1 
indicates  that,  in  general,  sue!  a  consistent 
ordering  does  not  exist. 


Ill  PARADIGMS  FOR  CURVE  PARTITIONING 

Almost  all  algorithms  employed  for  curve 
partitioning  appear  to  be  special  cases 
(instantiations)  of  one  or  more  of  the  following 
paradigms : 

*  Local  Detection  of  Distinguished  Points:  a 
partition  point  is  inserted  at  locations 
along  the  curve  at  which  one  or  more  of  the 
descriptive  attributes  (e.g.,  curvature, 
distance  from  a  coordinate  axis  or 
centroid)  is  determined  to  have  a 
discontinuity,  an  extreme  value  (maxima  or 
minima),  or  a  zero  value  separating 
intervals  of  positive  and  negative  values. 

*  Best  Global  Description,  a  set  of  partition 
points  is  inserted  at  those  locations  along 
a  curve  that  allow  the  "best"  description 
of  the  associated  segments  in  ter  s  of  some 
a  priori  set  of  models  (e.g.,  the  set  of 
models  might  consist  of  all  first  and 
second  degree  polynomials,  with  only  one 
model  permitted  to  explain  the  data  between 
two  adjacent  partition  points;  the  quality 
of  the  description  might  be  measured  by  the 
mean  square  deviation  of  the  data  points 
from  the  fitting  polynomials). 

*  Confirming  Evidence:  given  a  number  of 
"independent"  procedures  (or  possibly 
different  parameterizations  of  a  given 
procedure)  for  locating  potential  partition 
points,  we  retain  only  those  partition 
points  that  are  common  to  some  subset  of 
the  different  procedures  or  their 
parameterizations. 

*  Recursive  Simplification:  the  input  data  is 

subjected  to  repeated  applications  of  some 
transformation  that  monotonically  reduces 
some  measurable  aspect  of  the  data  to  one 
of  a  finite  number  of  terminal  states 
(e.g.,  differentiation,  smoothing, 

projection,  thresholding).  The  hierarchy 
of  data  sets  thus  produced  is  then 
processed  with  an  algorithm  derived  from 
the  previous  three  paradigms. 


IV  PRINCIPLES  OF  EFFECTIVE  (ROBUST) 
MODEL-BASED  INTERPRETATION 

What  underlies  our  choice  of  partitioning 
criteria?  We  assert  that  any  competent 
partitioning  technique,  regardless  of  which  of  the 
above  paradigms  is  employed,  will  incorporate  the 
following  principles. 


A.  Stability 

The  "principle  of  stability,"  is  the  assertion 
that  any  valid  perceptual  decision  should  be  stable 
under  at  least  small  perturbations  of  both  the 
imaging  conditions  and  the  decision  algorithm 
parameters.  This  generalization  of  the  assumption 
of  "general  position"  also  subsumes  the  assertion 
(often  presented  as  an  assumption)  that  most  of  a 
scene  must  be  describable  in  terms  of  continuous 
variables  if  meaningful  interpretation  is  to  be 
possible . 

It  is  interesting  to  observe  that  many  of  the 
constructs  in  mathematics  (e.g.,  the  derivative) 
are  based  on  the  concepts  of  convergence  and  limit, 
also  subsumed  under  the  stability  principle. 
Attempts  to  measure  the  digital  counterparts  of  the 
mathematical  concepts  have  traditionally  employed 
window  type  "operators”  that  are  not  based  on  a 
limiting  process;  it  should  come  as  no  surprise 
that  such  attempts  have  not  been  very  effective. 

In  practice,  if  we  perturb  the  various  imaging 
and  decision  parameters,  we  observe  relatively 
stable  decision  regions  separated  by  obviously 
unstable  intervals  (e.g.,  the  two  distinct  petcepts 
produced  by  a  Necker  cube).  The  stable  regions 
represent  alternative  hypotheses  that  generally 
cannot  be  resolved  without  recourse  to  either 
additional  and  more  restrictive  assumptions,  or 
semantic  (domain-specific)  knowledge. 

B.  Complete ,  Concise,  and  Complexity  Limited 

Explanation 

The  decision-making  process  in  image 
interpretation,  i.e.  matching  image  derived  data 
to  a  priori  models,  not  only  must  be  stable,  but 
must  also  explain  all  the  structure  observable  in 
the  data.  Equally  important,  the  explanation  must 
satisfy  specific  criteria  for  believabili ty  and 
complexity.  Believabili  ty  is  largely  a  matter  of 
offering  the  simplest  possible  description  of  the 
data  and,  in  addition,  explaining  any  deviation  of 
the  data  from  the  models  (vocabulary)  used  in  the 
description.  Even  the  simplest  description, 
however,  must  also  be  of  limited  complexity; 
otherwise  or  it  will  not  be  understandable  and  thus 
not  believable. 

By  making  the  foregoing  principles  explicit, 
we  can  directly  invoke  them  (as  demonstrated  in  the 
following  section)  to  formulate  effective 
algorithms  for  perceptual  organization. 


V  INSTANTIATION  OF  THE  THEORY:  SPECIFIC 
TECHNIQUES  FOR  CURVE  PARTITIONING 

In  this  section  we  offer  two  effective  new 
algorithms  for  curve  partitioning  (program  listings 
available  from  the  authors).  In  each  case,  we 
first  describe  the  the  algorithm,  and  later 
indicate  how  it  was  motivated  and  consUCxned  by 
the  principles  just  presented.  In  both  algorithms, 
the  key  ideas  are:  (1)  to  view  each  point,  or 
segment  of  a  curve,  from  as  many  perspectives  as 
possible,  retaining  only  those  partition  points 


226 


the  curve  and 


receiving  the  highest  level  of  multiple 
confirmation;  and  (2)  inhibiting  the  further 
selection  of  partition  points  when  the  density  of 
points  already  selected  exceeds  a  preselected  or 
computed  limit. 


A.  Curve  Partitioning  Based  on  Detecting  local 

Discontinuity 

In  this  sub-section  we  present  a  new  approach 
to  the  problem  of  finding  points  of  discontinuity 
("critical  points")  on  a  curve.  Our  criterion  for 
success  is  whether  we  can  match  the  performance  of 
human  subjects  given  the  same  task  (e.g.,  see 
Figure  1).  The  importa  'ce  of  this  problem  from  the 
standpoint  of  the  psychology  of  human  vision  dates 
back  to  the  work  of  Attneave  [ 1 954 ] .  However,  it 
has  long  been  recognized  as  a  very  difficult 
problem,  and  no  satisfactory  computer  algorithm 
currently  exists  for  this  purpose.  An  excellent 
discussion  of  the  problem  may  be  found  in  in  Davis 
[1977];  other  pertinent  references  include 
Rosenfeld  [1975],  Freeman  [1977],  Kruse  [1978],  and 
Pavlidis  [1980].  Results  and  observations  akin  and 
complementary  to  those  presented  here  can  be  found 
in  Hoffman  [1982]  and  in  Witkin  [1983]. 

Most  approaches  equate  the  search  for  critical 
points  with  looking  for  points  of  high  curvature. 
Although  this  intuition  seems  to  be  correct ,  it  is 
incomplete  as  stated  (i.e.,  it  does  not  explicitly 
take  into  account  "explanation"  complexity); 
further,  the  methods  proposed  for  measuring 
curvature  are  often  inadequate  in  their  selection 
of  stability  criteria.  In  Figure  2  we  show  some 
results  of  measuring  curvature  using  discrete 
approximations  to  the  mathematical  definition. 

We  have  developed  an  algorithm  for  locating 
critical  points  that  invokes  a  model  related  to, 
but  distinct  from,  the  mathematical  concept  of 
curvature.  The  algorithm  labels  each  point  on  a 
curve  as  belonging  to  one  of  three  categories: 
(a)  a  point  in  a  smooth  interval,  (b)  a  critical 
point,  or  (c)  a  point  in  a  noisy  interval.  To  make 
this  choice,  the  algorithm  analyzes  the  deviations 
of  the  curve  from  a  ”  chord  or  "stick"  that  is 
iteratively  advanced  along  the  curve  (this  will  be 
done  for  a  variety  of  lengths,  which  is  analogous 
to  analyzing  the  curve  at  different  resolutions). 
If  the  curve  stays  close  to  the  chord,  points  in 
the  interval  spanned  by  the  chord  will  be  labeled 
as  belonging  to  a  smooth  section.  If  the  curve 
makes  a  single  excursion  away  from  the  chord,  the 
point  in  the  interval  that  is  farthest  from  the 
chord  will  be  labeled  a  critical  point  (actually, 
for  each  placement  of  the  chord,  an  accumulator 
associated  with  the  farthest  point  will  be 
incremented  by  the  distance  between  the  point  and 
the  chord).  If  the  curve  makes  two  or  more 
excursions,  points  in  the  interval  will  be  labeled 
as  noise  points. 

We  should  note  here  that  "noisy"  intervals  at 
low  resolution  (large  chord  length)  will  have  many 
critical  points  at  higher  resolution  (small  chord 
length).  Figure  3  shows  examples  of  curve  segments 
and  their  classifications.  The  distance  from  a 
chord  that  defines  a  significant  excursion  (i.e., 
the  width  of  the  boxes  in  Figure  3)  is  a  function 


of  the  expected  noise  along 
length  of  the  chord. 


the 


At  each  resolution  (i.e.,  stick  size),  the 
algorithm  orders  the  critical  points  according  to 
the  values  in  their  accumulators  and  selects  the 
best  ones  first.  To ^  avoid  setting  an  arbitrary 
"goodness"  threshold  for  distinguishing  critical 
from  ordinary  points,  we  use  a  complexity 
criterion.  To  halt  the  selection  process,  we  stop 
when  the  points  being  suggested  are  too  close  to 
those  selected  previously  at  the  given  resolution. 
In  our  experiments  we  define  "too  close"  as  being 
within  a  quarter  of  the  stick  length  used  to 
suggest  the  point. 


After  the  critical  points  have  been  selected 
at  the  coarsest  resolution,  the  algorithm  is 
applied  at  higher  resolutions  *-0  locate  additional 
critical  points  thct  are  i  Tide  the  regions 
dominated  by  previously  selected  points.  Figure  4a 
shows  the  critical  points  determined  at  the  coarest 
level  (stick  length  of  100  pixels;  approximately 
1/10  of  the  length  of  the  curve).  Figure  4b  shows 
all  the  critical  points  labeled  with  the  stick 
lengths  used  to  determine  them.  (We  note  that  this 
critical  point  detection  procedure  does  not  locate 
inflection  points  or  smooth  transitions  between 
segments,  such  as  the  transition  from  an  arc  of  a 
circle  to  a  line  tangent  to  the  circle.) 

The  above  algorithm  appears  to  be  very 
effective,  especially  for  finding  obvious  partition 
points  and  in  not  making  "ugly”  mistakes  (i.e., 
choosing  partition  points  at  locations  that  none  of 
our  human  subjects  would  pick).  Its  ability  to 
find  good  partition  points  is  based  on  evaluating 
each  point  on  the  curve  from  multiple  viewpoints 
(placements  of  the  stick)  —  a  direct  application 
of  the  principle  of  stability.  Requiring  that  the 
partition  points  remain  stable  under  changes  in 
resolution  (i.e.,  small  changes  in  stick  length) 
did  not  appear  to  be  effective  and  was  not 
employed;  in  fact,  stick  length  was  altered  by  a 
significant  amount  in  each  iteration,  and  partition 
points  found  at  these  different  scales  of 
resolution  were  not  expected  to  support  each  other, 
but  were  assumed  to  be  due  to  distinct  phenomena. 

The  avoidance  of  ugly  mistakes  was  due  to  our 
method  of  limiting  the  number  of  partition  points 
that  could  be  selected  at  any  level  of  resolution, 
or  in  any  neighborhood  of  a  selected  point  (i.e., 
limiting  the  explanation  complexity).  One  concept 
we  invoked  here,  related  to  that  of  complete 
explanation,  was  that  the  detection  procedure  could 
not  be  trusted  to  provide  an  adequate  explanation 
when  more  than  a  single  critical  point  was  in  its 
field  of  view,  and  in  such  a  situation,  any 
decision  was  deferred  to  later  iterations  at  higher 
levels  of  resolution  (i.e.,  shorter  stick  lengths). 

Finally,  in  accord  with  our  previous 
discussion,  the  algorithm  has  two  free  parameters 
that  provide  control  over  its  definition  of  noise 
(i.e.,  variations  too  small  or  too  close  together 
to  be  of  interest),  and  its  willingness  to  miss  a 
good  partition  point  so  as  to  be  sure  it  does  not 
select  a  bad  one. 


B  Curve  Partitioning  Based  and  De tect lng  Process 

Homogenl ty 

To  match  human  performance  In  partitioning  a 
curve,  by  recognizing  those  locations  at  which  one 
generating  process  terminates  and  another  begins, 
Is  orders  of  magnitude  more  difficult  than 
partitioning  based  on  local  discontinuity  analysis. 
As  noted  earlier,  a  critical  aspect  of  such 
performance  is  the  size  and  effectiveness  of  the 
vocabulary  (of  a  priori  models)  employed. 
Explicitly  providing  a  general  purpose  vocabulary 
to  the  machine  would  entail  an  unreasonably  large 
amount  of  work  —  we  hypothesize  that  the  only 
effective  way  of  allowing  a  machine  to  acquire  such 
knowledge  is  to  provide  it  with  a  learning 
capability. 

For  our  purposes  in  this  investigation,  we 
chose  a  problem  in  which  the  relevant  vocabulary 
was  extremely  limited:  the  curves  to  be  partitioned 
are  composed  exclusively  of  straight  lines  and  arcs 
of  circles.  (Two  specific  applications  we  were 
interested  in  here  were  the  decomposition  of 
silhouettes  of  industrial  parts,  and  the 
decomposition  of  the  line  scans  returned  by  a 
"structured  light"  ranging  device  viewing  scenes 
containing  various  diameter  cylinders  and  planar 
faced  objects  lying  on  a  flat  surface.)  Our  goal 
here  was  to  develop  a  procedure  for  locating 
critical  points  along  a  curve  in  such  a  way  that 
the  segments  between  the  critical  points  would  be 
satisfactorily  modeled  by  either  a  straight-line 
segment  or  a  circular  arc.  Relevant  work 
addressing  this  problem  has  been  done  by  Montanari 
[1970],  Ramer  [1972],  Pavlidis  [1974],  Liao  [1981], 
and  Lowe  [1982 ] . 

Our  approach  is  to  analyze  several  "views"  of 
a  curve,  construct  a  list  of  possible  critical 
points,  and  then  select  the  optimum  points  between 
which  models  from  our  vocabulary  can  be  fitted. 
For  our  experiments  we  quantized  an  analytic  curve 
at  several  positions  and  orientations  (witn  respect 
to  a  pixel  grid),  then  attempted  to  reco  ;er  the 
original  model. 

For  each  view  (quantization)  of  the  curve  we 
locate  occurrences  of  lines  and  arcs,  marking  their 
ends  as  prospective  partition  points.  This  is 
accomplished  by  randomly  selecting  small  seed 
segments  from  the  curve,  fitting  to  them  a  line  or 
arc,  examining  the  fit,  and  then  extending  as  far 
as  possible  those  models  that  exhibit  a  good  fit. 
After  a  large  number  of  seeds  have  been  explored  in 
the  different  views  of  the  curve,  the  histogram 
(frequency  count  as  a  function  of  path  length)  of 
beginnings  and  endings  is  usea  to  suggest  critical 
points  (in  order  of  their  frequency  of  occurrence). 
Each  new  critical  point,  considered  for  inclusion 
in  the  explanation  of  how  the  curve  is  constructed, 
introduces  two  new  segments  which  are  compared  to 
both  our  line  and  circle  models.  If  one  or  both  of 
the  segments  have  acceptable  fits,  the 
corresponding  curve  segments  are  marked  as 
explained.  Otherwise,  the  segments  are  left  to  be 
explained  by  additional  critical  points  and  the 
partitions  they  imply.  The  addition  of  critical 
points  continues  until  the  complete  curve  is 
explained.  Figure  5  shows  an  example  of  the 
operation  of  this  algorithm. 


While  admittedly  operating  in  a  relatively 
simple  environment,  the  above  algorithm  exhibits 
excellent  performance.  This  is  true  even  in  the 
difficult  case  of  finding  partition  points  along 
the  smooth  interface  between  a  straight  line  and  a 
circle  to  which  the  line  is  tangent. 

Both  basic  principles,  stability  and  complete 
explanation,  are  deeply  embedded  in  this  algorithm. 
Retaining  only  those  partition  points  which  persist 
under  different  "viewpoints"  was  motivated  by  the 
principle  of  stability.  Our  technique  for 
evaluating  the  fit  of  the  segment  of  a  curve 
between  two  partition  points,  to  both  the  line  and 
circle  models,  requires  that  the  deviations  from  an 
acceptable  model  have  the  characteristics  of 
"white"  (random)  noise;  this  is  an  instantiation  of 
the  principle  of  complete  explanation,  and  is  based 
on  our  previous  work  presented  in  Bolles  [1982]. 


VI  DISCUSSION 

We  can  summarize  our  key  points  as  follows: 

*  The  partition  problem  does  not  have  a 
unique  definition,  but  is  parameterized 
with  respect  to  such  items  as  purpose,  data 
representation,  trade-off  between  different 
error  types  (false-alarms  vs  misses),  etc. 

*  Psychologically  acceptable  partitions  are 
associated  with  an  implied  explanation  that 
must  satisfy  criteria  for  accuracy, 
complexity,  and  believabili ty .  Thesa 
criteria  can  be  formulated  in  terms  of  a 
set  of  principles,  which,  in  turn,  can 
guide  the  construction  of  effective 
partitioning  algorithms  (i.e.,  they  provide 
necessary  conditions). 

One  implication  contained  in  these 
observations  is  that  a  purely  mathematical 
definition  of  "intrinsic  structure"  (i.e.,  a 
definition  justified  solely  by  appeal  to 
mathematical  criteria  or  principles)  cannot,  by 
itself,  be  sufficiently  selective  to  serve  as  a 
basis  for  duplicating  human  performance  in  the 
partitioning  task;  generic  partitioning  (i.e., 
partitioning  in  the  absence  of  semantic  content)  is 
based  on  psvcho logical  "laws"  and  physiological 
mechanisms,  as  well  as  on  correlations  embedded  in 
the  data. 

In  this  paper  we  have  looked  at  a  very  limited 
subset  of  the  class  of  all  scene  partitioning 
problems;  nevertheless,  it  is  interesting  to 
speculate  on  how  the  human  performs  so  effectively 
in  the  broader  domain  of  interpreting  single  images 
of  natural  scenes.  The  speed  of  response  in  the 
humans  ability  to  interpret  a  sequence  of  images  of 
dissimilar  scenes  makes  it  highly  questionable  that 
there  is  some  mechanism  by  which  he  simultaneously 
matches  all  his  semartic  primitives  against  the 
imaged  data',  even  if  we  assume  that  some 
independent  process  has  already  presented  him  with 
a  "camera  model"  that  resolves  some  of  the 
uncertaintias  in  image  scale,  orientation,  and 
projective  distortion.  How  does  the  human  index 
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Into  the  large  semantic  data  base  to  find  the 
appropriate  models  for  the  scene  at  hand? 

Consider  the  following  paradigm:  first  a  set 
of  coherent  components  Is  recovered  from  the  image 
on  the  basis  of  very  general  (but  parameterized) 
clustering  criteria  of  the  type  described  earlier; 
next,  a  relatively  small  set  of  semantic  models, 
which  are  components  o  many  of  the  objects  in  the 
complete  semantic  vocabulary,  are  matched  against 
the  extracted  clusters;  successful  matches  are  then 
used  to  index  into  the  full  data  base  and  the 
corresponding  entries  are  matched  against  both  the 
extracted  clusters  and  adjacent  scene  components; 
these  additional  successful  matches  will  now 
trigger  both  iconic  and  symbolic  associations  that 
result  in  further  matching  possibilities  as  well  as 
perceptual  hypotheses  that  organize  large  portions 
of  the  image  into  coherent  structures  (gestalt 
phenomena) . 

If  this  paradigm  is  valid,  then,  even  though 
much  of  the  perceptual  process  would  depend  on  an 
individual's  personal  experience  and  immediate 
goals,  we  might  still  expect  "hard  wired" 
algorithms  (genetically  programmed,  but  with 
adjustable  parameters)  to  be  employed  in  the 
Initial  partitioning  steps. 

In  this  paper,  we  have  attempted  to  give 
computational  definitions  to  some  of  the  organizing 
criteria  needed  to  approach  human  level  performance 
in  the  partitioning  task.  However,  we  believe  that 
our  more  Important  contribution  has  been  the 
explicit  formulation  of  a  set  of  principles  that  we 
assert  must  be  satisfied  by  any  effective  procedure 
for  perceptual  grouping. 
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TASK  1  Select  AT  MOST  5  points  to  describe  this  line  drawing  so  that 
you  will  be  able  to  reconstruct  it  as  well  as  possible  10  years 
from  now,  given  just  the  sequence  of  selected  points. 

Since  five  points  were  sufficient  to  form  an  approximate  convex  huil 
of  the  figure,  virtually  everyone  did  so,  selecting  the  5  points  shown  below. 


TASK  2:  Assume  that  a  friend  of  yours  is  going  to  be  asked  to  recognize 
this  line  drawing  on  the  basis  of  the  information  you  supply  him 
about  it.  He  will  be  presented  with  a  set  of  drawings,  one  of 
which  will  be  a  rotated  and  scaled  version  of  this  curve.  You  are 
only  allowed  to  provided  him  with  A  SEOUENCE  OF  AT  MOST 
5  POINTS.  Mark  the  points  you  would  select. 

Since  5  points  were  not  enough  to  outline  all  the  key  features  of  the 
figure,  the  subjects  had  to  decide  what  to  leave  out,  They  seemed  to  adopt 
one  of  two  general  strategies;  (a)  use  the  limited  number  of  points  to  describe 
one  distinct  feature  well  (illustrated  by  the  selection  on  the  left),  or  (b)  use 
the  points  to  outline  the  basic  shape  of  the  figure  (shown  on  the  right). 

_ j 
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TASK  3:  This  line  drawing  was  constructed  by  piecing  together  segments 
produced  by  different  processes.  Please  indicate  where  you  think 
the  junctions  between  segments  occur  AND  VERY  BRIEFLY 
DESCRIBE  EACH  SEGMENT.  Use  as  few  points  as  possible, 
but  no  more  than  5. 

The  constraint  of  being  limited  to  5  points  forced  the  subjects  to  con 
sider  the  whole  curve  and  develop  a  consistent,  global  explanation.  The 
basic  strategy  seemed  to  be  a  recursive  one  in  which  they  first  partitioned  the 
curve  into  2  segments  by  placing  a  breakpoint  at  position  1  and  another  one 
at  either  position  2  or  position  3  to  separate  the  smooth  curves  from  the 
sharp  corners.  Then  they  used  the  remaining  points  to  subdivide  these  seg 
ments  according  to  a  vocabulary  they  selected  that  included  such  things  as 
triangles,  rectangles,  and  sinusoids.  For  example,  almost  everyone  placed 
breakpo.nts  at  positions  3  and  4  and  described  the  enclosed  segment  as  part 
of  a  triangle.  Similarly  the  segment  between  positions  1  and  5  was  generally 
described  as  a  decaying  sinusoid.  It  is  interesting  to  note  that  in  task  1  the 
subiects  consistently  placed  a  point  close  to  position  5  but  always  farther  to 
the  right,  because  they  were  trying  to  approximate  a  convex  hull.  The  dif¬ 
ferent  purposes  led  to  different  placements. 


(a)  This  figure  shows  the  results  of  applying  the  "improved  <  ngle  detection' 
procedure  described  in  Rosenfeld  [1975]  to  a  digitized  version  of  the 
curve  in  Figure  1.  The  procedure  works  quite  well,  except  for  the  intro¬ 
duction  of  a  breakpoint  in  the  middle  of  the  right  side  and  the  merging 
of  two  small  bumps  at  the  right  of  the  sinusoidal  segment. 


(b!  However,  if  we  extract  a  portion  of  the  curve  and  apply  the  algorithm, 
it  introduces  several  additional  breakpoints  because  the  change  in  curve 
length  causes  some  of  the  algorithm  parameters  to  change. 

FIGURE  2  ESTIMATION  OF  CURVATURE  FROM 
DISCRETE  APPROXIMATIONS 


FIGURE  1  EXPERIMENTS  IN  WHICH  HUMAN  SUBJECTS 
WERE  ASKED  TO  SEGMENT  A  CURVE 


FIGURE  3  EXAMPLE  CURVE  SEGMENTS  AND 
THEIR  CLASSIFICATIONS 
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(c)  Additional  examples  of  the  local  discontinuity  analysis 


FIGURE  4  LOCAL  DISCONTINUITY  PARTITIONING 


BEGINNING 


BEGINNING 


(a)  An  analytic  curve  consisting  of  two  straight  segments 
and  a  circular  arc 


(b>  A  multiply  explained  segment  of  the  curve  formed  by 
the  extension  of  the  arc  and  one  of  the  line  segments 
to  include  as  many  compatible  points  as  possible 
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Abstract 

In  this  paper  we  present  an  implementation  of 
hierarchical  scene  matching  in  the  VISIONS  image 
processing  cone  -  a  pyramidal  processing  architecture.  The 
problem  of  scene  matching  is  common  to  many  applications 
in  machine  vision  including  registration,  motion  detection, 
and  stereo  vision.  Scene  matching  by  feature  correlation 
can  solve  this  problem  but  suffers  from  computational 
expense  and  failure  in  highly  textured  images.  Hierarchical 
correlation  provides  both  a  cheaper  matching  algorithm  and 
a  coarse-to-fine  matching  strategy  that  overcomes  "textural” 
problems  by  matching  on  gross  image  structures  first. 
These  methods  fit  naturally  into  the  processing  cone  or 
pyramid  architectures  that  have  been  proposed  for  image 
processing.  We  present  a  discussion  of  the  architecture  of 
the  processing  cone,  the  construction  of  image  pyramids, 
and  the  use  of  these  pyramids  in  hierarchical  correlation. 
A  set  of  experiments  illustrates  the  operation  of  these  idea*. 


High  frequency  texture  in  the  image  may  provide  enough 
repetitive  pattern  that  a  number  of  different  matches  may 
be  considered  equally  valid  by  the  correlation  technique. 
On  the  other  hand  the  lowest  frequencies  may  be  due  to 
illumination  differences  and  may  bias  the  correlation 
measure  away  from  the  veridical  match.  Different  kinds  of 
normalizations  can  help  overcome  this  problem,  but  only  at 
greater  computational  cost  [6].  In  order  to  avoid  false 
matches  it  may  be  necessary  to  use  large  sample  windows, 
which  also  increases  this  cost. 

The  hierarchical  matching  technique  described  here 
overcomes  both  of  these  problems.  First,  the  matching  is 
done  initially  based  on  the  larger  structures  in  the  images 
(since  they  become  prominent  at  low  frequencies),  providing 
ball-park  estimates  for  matching  higher  frequency 
information  at  levels  below.  This  overcomes  the  problems 
due  to  high  frequency  textures.  Secondly,  the  coarse-fine 
strategy  restricts  the  search  to  3  x  3  areas  at  each  level 
significantly  reducing  the  computational  cost. 


0.0  Introduction 

The  problem  of  matching  digital  images  by  correlation 
techniques  is  an  important  and  well  known  problem  in 
computer  vision  and  pattern  recognition.  It  has  applications 
in  image  registration,  object  detection  by  template  matching, 
and  motion  and  stereo  analysis.  This  paper  describes  a 
hierarchical  approach  to  this  problem,  and  includes  some 
results  on  noisy  real  world  images. 


Basically  the  technique  consists  of  matching  band-passed 
versions  of  the  images  at  different  levels  of  resolution.  The 
filters  applied  approximate  convolution  with  VJG  operators 
of  different  sizes.1  The  size  of  the  Gaussian  increases  as 
the  resolution  becomes  coarser,  in  such  a  way  as  to  limit 
the  frequency  content  in  the  image  to  avoid  aliasing  due  to 
the  sampling  rate  at  each  level  of  resolution.  The 
elimination  of  low-frequencies  in  the  image  helps  overcome 
any  problems  due  to  illumination  and  scaling  differences. 


While  there  have  been  a  number  of  studies  of  correlation 
techiques  for  matching  images  [1,14],  the  two  basic  problems 
regarding  computational  costs  and  false  match  have  not 
been  solved.  If  the  image  displacements  are  limited  to  be 
less  than  D  pixels  in  either  direction,  then  there  are 
(2D  +  l)2  possible  test  locations  for  matching  at  each  pixel 
The  problem  of  false  matches  may  arise  for  several  reasons. 


While  this  sort  of  approach  to  improving  matching  has 
been  discussed  by  other  authors  [15,17,12,9,3],  to  our 
knowledge  only  Wong  and  Hall  have  applied  it  to  real 
images  and  studied  the  issues  in  doing  so.  Our  teebniaue, 
which  more  closely  resembles  the  one  outlined  by  Burt  [3], 
differs  significantly  from  the  approach  of  Wong  and  Hall. 


1  This  research  was  supported 
Grant  N00014-82-K-0464.  I 


in  part  by  DARPA  under 


1.  The  V2G  (read  del-two-g)  operator  is  the  Laplacian 
applied  to  the  result  of  convolving  with  a  Gaussian. 
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1.0  The  Process:  ng  Cone  Structure  and  Image 
Pyramids 

In  our  hierarchical  algorithm  images  are  represented  at 
varying  levels  of  spatial  resolution.  Coarse  resolution 
images  are  obtained  by  low  pass  filtering  and  sub-sampling. 
Normally  the  filtering  is  done  by  convolution  smoothing 
with  Gaussian-like  kernels.  This  low-pass  filtering  allows  us 
to  sub-sample  these  images  and  store  them  in  coarse  grids 

1.1  Tlie  processing  cone 

The  natural  architecture  for  these  image  pyramids  is  the 
processing  cone  [8],  a  multilayer,  multiresolution  organization 
of  image  planes  upon  which  inter-  and  intra-layer  image 
operators  are  applied  (see  Figure  ').  Operations  which 
produce  coarse  images  from  finer  ones  are  called  reduction1:, 
while  those  that  produce  finer  resolution  images  from  coarse 
ones  are  called  projections.  This  hierarchical  data  structure 
has  also  appeared  as  pyramids  in  [’')].  Similar 
multiresolution  representations  have  been  used  before  in 
scene  matching  [17,13,12], 


Figure  1:  The  processing  cone. 

This  parallel  array  computer  is  hierarchically  organized 
into  layers  of  decreasing  spatial  resolution.  Information 
within  the  cone  is  transformed  by  means  of  functions 
operating  on  local  windows  of  data.  Cone  algorithms 
are  specified  as  sequences  of  these  parallel  functions 
applied  in  one  of  three  processing  modes:  reduction  (up 
the  cone),  projection  (down  the  cone),  and  iteration  (at 
the  same  level). 

The  processing  cone  is  composed  of  levels  0  to  L,  each 
level  being  2^  pixels  on  a  side.  When  we  place  two 
adjacent  levels  in  registration,  each  coarse  pixel  overlays 
four  filler  pixels.  We  will  call  these  pixels  fathers  and  sons 
respectively  These  pixels  cover  the  same  square  area  of 
the  image  space  Going  from  one  level  to  a  coarser  we 
get  a  four  to  one  reduction  in  data.  The  highest  spatial 


frequency  that  can  be  represented  at  a  coarse  level 
(corresponding  to  the  Nyquist  rate)  is  half  that  which  can 
be  represented  at  the  next  finer  level  Thus  for  each  step 
up  the  cone  (coarsening)  the  spatial  frequency  bandwidth 
(relative  to  the  finest  grid)  is  cut  in  half. 

1.2  Low  pass  pyramids 

When  we  fill  the  cone  with  reduced  resolution  copies  of 
the  image  by  subsampling,  low  pass  filtering  must  be  dore 
to  prevent  aliasing.  Aliasing  occurs  when  the  one-of-feur 
subsampling  that  gives  the  next  coarser  level  produces 
spurious  image  components  for  any  spatial  frequency  in  the 
upper  halves  of  the  frequency  spectrum  of  the  image  being 
sampled.  The  simple1!  low  pass  filter  v.e  can  use  is 
obtained  by  taking  the  average  of  the  four  sons  of  a  coarse 
pixel  (as  in  [16]  and  [17]). 

A  slightly  more  complicated  family  of  reductions  has 
been  proposed  by  Burt  [2],  He  has  approached  the 

inter-level  low-pass  filter  design  by  considering  the  net 

convolution  obtained  when  a  given  reduction  is  applied  at 
progressively  higher  (coarser)  levels  For  example,  the 
two-by-two  averaging  reduction  when  applied  twice  is 
equivalent  to  a  four-by-four  averaging  reduction  up  two 
levels  in  the  cone.  Continued  application  of  two-by-two 
averaging  up  the  cone  always  results  in  "flat”  equivalent 
convolution  masks  (i.e.  unweighted  averaging).  Burt  shows 
hew  by  using  slightly  larger  four-by-four  kernels  the 
equivalent  convolution  masks  can  be  made  to  approximate 
Gaussian-like  low-pass  filters.  We  have  used  the  following 
such  kernel  in  most  of  our  experiments 
[1  3  3  1  ]  x  [1  3  3  l]1.1  The  result  of  applying  this  4x4 
operator  on  an  image  appears  in  Figure  2. 


Figure  2:  Low  pass  pyramid. 

Levels  4  through  7  of  the  low  pass  pyramid  obtained 
from  the  mandrill  eye  image  by  applying  the  4x4 
reduction  operator  [1  3  3  1]  x  [1  3  3  l]1. 

1  (  •  •  •  1  is  a  column  vector,  'x'  is  the  outer  product 
operation,  and  't'  is  the  transpose  operator. 
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1J  Band  pass  pyramids 


2.0  Correlation  Matching 


In  the  next  section  we  will  see  that  correlation  matching 
is  better  performed  when  low  spatial  frequencies  (relative  to 
the  grid  size)  have  been  filtered  out.  Such  high-pass 

filtering  performed  at  ach  level  of  a  low  pass  pyramid 

effectively  produces  a  band-passed  image  at  each  level. 

A  good  choice  for  this  high-pass  filtering  is  a  discrete 
1  1/4  1/2  1/4 

Laplacian  such  as  1-4  1  or  1/2  -3  1/2.  Such  a 

1  1/4  1/2  1/4 

mask  can  be  applied  to  each  level  of  the  low-pass  pyramid. 

A  second  method  for  generating  a  band-pass  is  Hurt's 

Laplacian  pyramid  [3,5],  Here  the  fact  that  a  V2G  can  be 
approximated  by  a  difference  of  Gaussians  is  used  to 
effectively  compute  band-pass  filters  by  differencing  adjacent 
levels  of  a  Gaussian  (low  pass)  pyramid.  The  difference  is 
taken  between  the  finer  level  and  an  appropriate  projection 
of  the  coarser  level  Figure  3  shows  the  Laplacian  pyramid 
derived  from  the  Gaussian  pyramid  in  Figure  2. 

Finally,  a  third  method  for  computing  a  band-pass 

pyramid,  is  to  perform  a  Laplacian  at  the  finest  level  and 
then  use  the  high-pass  output  as  the  base  of  a  low-pass 
pyramid.  This  method  is  used  in  the  optic  fundus  image 
experiments  in  section  5.2  .  Some  of  our  recent  theoretical 
work  has  suggested  that  this  method  has  an  aliasing 
problem,  but  we  have  not  yet  seen  it  in  our  experiments. 


Figure  3:  Band  pass  pyramid. 

Levels  4  through  7  of  the  band  pass  pyramid  obtained 
from  the  low  pass  pyramid  in  Figure  2. 


in  correlation  matching  a  sample  window  about  a  point  in 
one  image  is  compared  to  trial  windows  in  the  second 
image.  The  point  in  the  second  image  whose  trial  window 
gives  the  optimai  correlation  value  is  chosen  as  the  match 
point  This  can  be  considered  a  special  case  of  the  more 
general  method  of  feature  matching.  An  example  of 
alternative  features  are  the  edges  that  were  used  in  Marr 
and  Poggio's  matching  algorithm  [12]. 

There  are  more  than  a  few  variations  of  the  basic 

correlation  measure  [eg.,  see  6]  The  basic  correlation  from 
which  the  family  of  measures  derives  its  name  is  fhc  sum 
of  the  pairwise  product  of  corresponding  pixels  in  two 

windows.  Common  variations  include  1)  mean  normalized 
correlation,  in  which  the  mean  of  the  values  in  each 

window  is  subtracted  from  each  value;  2)  variance 

normalized  correlation,  in  which  tht  correlation  sum  is 

divided  by  the  variance  of  the  two  windows;  3)  sum  of 
squared  differences,  and  4)  sum  of  the  magnitude  of 

differences. 

Mean  normalized  correlation  is  equivalent  to  basic 
correlation  performed  after  a  specific  pre-filtering  of  the 

two  images.  The  filter  (or  convolution  mask)  used  in  this 

case  is  the  difference  of  a  ’’flat”  averaging  mask  and  the 
identity  mask  (a  discrete  impulse).  As  such  it  can  be 
considered  as  a  high-pass  filter  and  is  thus  related  to  other 
high  pass  filters  such  as  the  discrete  Laplacian.  However, 
the  frequency  response  of  the  subtract-local-mean  filter  is 
not  as  flat  at  high  frequencies  as  that  of  discrete 

Laplacians.  For  this  reason,  we  are  led  to  consider 

Laplacian  pre-filtered  basic  correlation  as  a  substitute  for 

mean  normalized  correlation. 

in  our  experiments  we  have  used  8x8  sample  windows. 
This  choice  is  intended  to  provide  a  tradeoff  betweem  small 
windows  which  are  more  immune  to  occlusion  and 
distortion  problems  and  large  windows  which  capture  a 
large  amount  of  matchable  structure.  We  have  not  yet 

experimented  with  other  sample  window  sizes. 

3.0  Search  Strategy 

One  way  of  looking  at  matching  is  as  a  process  of 

searching  for  the  point  that  optimizes  some  measure  of 
similarity.  In  our  case  the  measure  is  the  local  correlation 
measure  between  the  two  images.  The  strategy  adopted  for 
searching  for  the  point  of  match  (i.e.,  the  point  where  the 
measure  is  maximized)  should  not  only  attempt  to  decrease 
the  number  of  false  matches,  but  also  reduce  the 
computational  cost  involved  in  searching. 

3.1  3  by  3  Search 

The  search  strategy  adopted  in  our  process  begins  at  a 
coarse  level  where  the  maximum  displacement  is  within  one 
pixel  in  both  directions  (see  Figure  4).  Let  this  be  level 

Ld.  The  search  is  conducted  in  a  3  x  3  area  around  the 
point  of  interest  at  level  Ld  r.n  the  band-pass  images  at  that 
level.  The  resulting  displacements  at  this  level  (along  each 
axis)  are  cither  -1,  0,  or  1  The  value  obtained  here  is 
within  1/2  pixel  of  the  correct  displacement  at  this  level. 


At  each  level  below  (say  L.),  the  displacement  values  for 
a  given  point  are  projected  down  from  its  father  pixel  in 
the  next  level  above.  Due  to  the  doubling  of  resolution 
(thus  halving  the  pixel  width)  this  value  is  double  the 
value  of  the  father  pixel  This  establishes  the  displacements 
within  one  pixel  accuracy  in  either  direction  at  this  level 
(Lj).  Searching  in  a  3  x  3  area  at  this  level  refines  the 
displacement  to  within  1/2  pixel  accuracy  at  this  level.  The 
process  is  repeated  up  to  and  including  the  finest  level  of 
resolution  of  the  image 


(a)  Displacement  vector  at  level  N 

(b)  Displacement  vector  projected  to  its  four  sons 

at  level  N+l  (only  one  of  the  four  sons  is  shown) 

(c)  Search  in  a  3  x  3  area  at  level  N  +  l 

(search  area  shown  in  double  lines) 

(d)  Updated  displacement  vector 


3.2  Computational  Costs 

The  computational  advantages  of  hierarchical  versus  single 
level  search  strategies  can  be  measured  in  two  ways.  We 
can  consider  how  many  points  are  searched  in  arriving  at  a 
final  match  at  the  finest  level.  Each  ancestor  of  the  final 
point  matched  will  contribute  to  this  measure.  On  the 
other  hand,  we  can  measure  the  cost  of  obtaining  matches 
at  all  points  at  the  finest  level.  The  former  measure 
should  be  used  when  matching  is  confined  to  a  relative 
sparse  set  of  interesting  points,  while  the  latter  is  used 
when  matching  is  done  almost  everywhere.  Since  each  of 
these  case  arc  interesting,  we  will  look  at  both. 

First,  let  us  consider  the  comparative  cost  of  arriving  at 
a  single  match  at  the  finest  level.  Let  D  be  the  maximum 
displacement  (measured  at  the  finest  level).  Then  the 
initial  coarse  search  will  be  performed  log2D  levels  above 
the  finest  level.  The  number  of  points  searched  is 


(log2D  +  3)  *  9,  because  the  search  at  each  level  is 
restricted  to  a  3x3  neighborhood  and  there  are 
logjD  +  1  levels  for  search.  On  the  other  hand,  a 
correlation  process  searching  at  one  level  through  all  points 
closer  than  the  maximum  displacement  uses  (2D  +  l)3 
points  for  comparison. 

Now  consider  the  cost  of  computing  matches  at  all  points 
in  the  finest  level  image.  Let  the  finest  level  image 
contain  N‘  points  and,  as  above,  let  D  be  the  maximum 
displacement.  Single  level  correlation  would  then  search 
N2(2D  +  l)2  points.  Hierarchical  correlation  would  search 

9  *  (N2  +  N2/4  +  N2/16  +  .„.)  points,  where  the 
summation  is  over  all  levels  at  which  matching  takes  place. 
However  high  those  levels  go,  the  sum  is  always  less  than 
that  of  the  corresponding  geometric  sequence,  viz.  4N2/3. 
Thus  the  number  of  points  searched  hierarchically  is  12N2. 
Table  1  shows  a  comparison  of  costs. 


D 

S 

H 

S/H 

S/12 

1 

9 

9 

1.0 

. 

2 

25 

18 

1.4 

2.1 

4 

81 

27 

3.0 

6.75 

3 

289 

36 

8.0 

24.1 

16 

1089 

45 

242 

90.1 

32 

4225 

54 

78.2 

352. 

Table  1:  Cost  of  single  level  vs.  hierarchical  search. 

This  table  compares  hierarchical  and  single  level  search 
strategies.  D  is  the  maximum  displacement, 
S  =  (2D+1)2  is  the  cost  of  single  level  search, 
H  =  9(log2D  +  1)  is  the  cost  of  single  match 
hierarchical  search,  S/H  is  their  relative  cost  factor. 
S/12  is  the  relative  cost  factor  between  single  level  and 
hierarchical  full  matching  (i.e.  at  ail  points). 


3J  Existence  and  Uniqueness  of  Matches 

The  discussion  in  section  3.1  shows  how  the  technique  of 
coarse-fine  search  strategy  automatically  ensures  that  the 
correct  match  must  exist  within  the  3x3  local  search 
window  at  each  level.  The  filtering  and  subsampling 
orocesses  ensure  that  the  highest  frequency  at  a  particular 
level  corresponds  to  a  wavelength  of  two  pixels  at  that 

level.  Since  the  search  is  restricted  to  3  x  3  windows  at 
that  level,  we  have  some  confidence  that  the  match 
obtained  within  this  window  is  unique.  Also,  the 

elimination  of  lower  frequencies  at  that  level  help  provide 
sufficient  variation  in  the  correlation  measure  within  this 
window. 

Strictly  speaking,  this  argument  is  valid  only  for  a  one 

dimensional  version  of  this  process.  In  the  two  dimensional 
case  there  is  no  high  frequency  content  along  straight 
edges.  This  can  lead  to  nearly  constant  values  of  the 

correlation  measure  along  these  edges,  thus  leading  to  false 
matches.  This,  in  fact,  leads  us  to  the  use  of  interest 
operators  to  eliminate  points  that  can  potentially  lead  to 
false  matches. 
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4.0  The  Problem  of  False  Matches 


5.0  Experiments 


The  match  for  a  point  obtained  by  the  correlation 
technique  may  not  always  correspond  to  it's  environmental 
match  There  are  three  basic  reasons  for  this.  First, 
correlation  only  deals  with  translational  disparity  (in  the 
image  plane)  and  so  should  break  down  with  increasing 
rotational  and  scaling  components  in  the  disparity. 
However,  as  the  optic  fundus  image  experiments  show, 
small  rotations  can  be  dealt  with.  Secondly,  in  practical 
imaging  situations  there  can  be  significant  amounts  of  noise 
in  the  images.  Most  often  this  would  most  adversly  affect 
matching  at  finer  resolutions.  A  third  cause  of  false 
matches  is  the  occurence  of  occlusions  in  an  image.  Two 
problems  can  arise  here  1)  points  in  one  image  may  have 
no  counterpart  in  the  other  image;  2)  points  on  an 
occlusion  boundary  have  neighboring  windows  which  change 
identity  from  frame  to  frame. 

4.1  Interest  operators 

Interest  operators  arc  designed  to  pick  out  points  for 
which  matches  can  be  found  w:th  a  high  reliability.  This 
detection  of  matchability  is  the  key  clement  of  an  interest 
operator.  They  can  also  be  used  to  restrict  processing  to  a 
small  subset  of  all  the  image  points  to  reduce  computation 
costs.  On  serial  machines  this  is  certainly  useful. 
However,  on  image  parallel  machines  such  as  the  processing 
cone,  we  need  only  be  concerned  with  matchability. 

In  order  that  a  point  in  one  image  image  be  matchable 
in  another  image,  the  point  must  be  matchable  with  itself. 
For  this  to  be  true  the  local  autocorrelation  function  must 
possess  a  strict  local  maxima.  A  sufficient  condition  for 
this  is  the  occurrence  of  a  strong  comer  at  a  point. 
Kitchen  and  Rosenfeld  [11]  present  an  analysis  of  various 
corner  finding  algorithms  and  these  algorithms  yield  very 
good  interest  operators.  Moravec  [13]  gives  an  interest 
operator  which  attempts  to  compute  the  sharpness  of  an 
approximate  autocorrelation  function  directly. 

In  our  hierarchical  correlation  experiments  we  have  taken 
two  approaches  to  applying  interest  operators.  In  the  first 
approach,  an  interest  operator  is  applied  at  the  finest  level 
of  the  first  frame  to  select  the  points  to  be  matched,  and 
then  a  logical  pyramid  is  formed  by  using  ”OR”  in  the 
4x4  reduction  operation.  Matching  is  only  performed  at 
those  points  which  have  a  value  of  TRUE  in  this  pyramid. 
This  method  is  comparable  to  Moravec's  search  strategy 

[B]. 

In  the  second  approach,  interest  operators  are  applied  at 
all  levels.  In  this  case  there  can  be  interesting  pixels  with 
un-interesting  fathers.  For  these  pixels,  we  can  do  one  of 
two  things:  1)  the  search  can  be  done  in  a  larger  search 
area,  or  2)  a  displacement  estimate  can  be  obtained  based 
on  neighboring  pixels. 

We  are  currently  studying  both  approaches  but  we  only 
present  the  first  approach  in  the  experiments  below.  One 
of  the  surprising  results  of  Our  experiments  is  that  even  at 
points  which  appear  to  be  "uninteresting”,  correct  matches 
are  obtained.  Depending  on  the  domain  of  application,  our 
experiments  show  that  a  large  amount  of  computation  to 
find  interesting  points  may  be  unnecessary. 


5.1  Mandrill  Image  experiments 

In  this  experiment  we  took  the  standard  USC  image  cf  a 
mandrill  and  extracted  a  1282  subimage  of  it  (figure  5a). 
We  created  a  second  image  by  adding  white  gaussian  noise 
to  this  image  and  translating  it  5  pixels  up  and  7  pixels  to 
the  right  with  respect  to  the  first  image  (Figure  5b).  The 
standard  deviation  of  the  noise  added  was  250  which  is 
10%  of  the  intensity  range  of  the  image 

We  conducted  two  experiments  with  these  images.  Th» 
first  was  the  hierarchical  matching  process.  The  Laplacian 
pyramids  were  constructed  using  Burt's  techniques.  The 
matching  at  each  level  was  done  using  an  8  x  8  sample 
window  at  all  points  in  the  image.  The  results  at  the 
various  levels  are  shown  in  Figure  6  and  Figure  7. 
Figure  6  shows  results  at  levels  4,5,6,  and  7.  At  each  level 
the  displacement  estimates  are  shown  at  a  sampling  of  64 
points. 


Figure  5:  Mandrill  eye  Images  used  In  the  first  experiment. 

(a)  A  1282  piece  of  the  larger  mandrill  image. 

(b)  A  similar  piece,  translated  5  pixels  up  and  7  to  the 
right,  with  white  gaussian  noise  added  (standard 
deviation  =  10%  of  full  range). 


In  Figure  7  two-dimensional  histograms  of  the  row  versus 
the  column  components  of  the  displacements  are  shown  for 
each  of  level  4  through  7  (Figure  7,  a  through  d).  Note, 
in  Figure  7d,  the  high  count  found  in  the  bucket 
corresponding  to  the  correct  displacement  of  (-5,7).  The 
histogram  for  level  7  (Figure  7d)  indicates  that  ab  'ut  87% 
of  the  displacement  values  are  exact.  This  shows  that  the 
hierarchical  process  is  quite  insensitive  to  noise. 

In  the  second  experiment,  we  attempted  to  match  these 
two  images  using  a  correlation  process  all  at  one  level.  In 
doing  this  we  used  8x8  sample  windows  and  searched  in 
a  17  x  17  search  area  around  each  pixel  (the  actual 
displacements  of  -5,7  will  fall  within  this  range).  The 
results  of  this  process  are  shown  in  Figure  8.  Note  the 
greatly  reduced  accuracy  of  this  method  (53%  correct) 
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Figure  7:  Distribution  of  displacement  vectors. 

The  histograms  of  the  row  and  column  components  of 
the  displacements  obtained  in  the  Mandrill  experiment. 
Levels  4  through  7  have  been  shown.  Note  the  peak  at 
(-5,7)  in  figure  7d. 
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Figure  8:  Single  level  correlation. 

Results  of  the  variance  normalized  correlation  applied 
to  the  Mandrill  images  (Figure  5). 

(a)  shows  the  displacement  vectors  and  (b)  shows  the 
displacement  histograms. 


52  Optic  fundus  image  experiments 


The  next  problem  to  which  we  applied  the  hierarchical 
matching  algorithm  was  that  of  registering  two  fluorescein 
angiogram  images  of  the  optic  fundus  (sec  Figure  9). 
These  images  were  obtained  from  Paul  Nagin  at  the  Tufts 
New  England  medical  center  and  are  digitized  as  1282 
images.  The  problem  is  to  register  two  images  taken  at 
the  beginning  and  at  the  peak  of  dye  filling.  Areas  which 
show  very  little  change  are  recognized  as  regions  where  no 
filling  of  the  dye  is  taking  place.  This  measurement  can 
then  be  used  in  the  prognosis  of  glaucoma.  Due  to  severe 
contrast  changes  over  this  time  interval  it  is  necessary  to 
register  a  temporal  sequence  of  8  to  10  images. 

In  this  experiment  the  finest  level  image  was  bandpass 
filtered  using  a  V2G  convolution.  The  Gaussian 
convolution  used  has  a  standard  deviation  of  two  pixels  and 
introduces  some  smoothing  at  the  finest  level.  In  fact,  we 
implement  this  convolution  using  the  Fast  Fourier  Transform 
with  the  filtering  done  in  the  frequency  domain.  A 
pyramid  is  formed  using  the  72G  f'ltered  image  as  the 
base.  In  Figure  10  we  show  the  results  of  applying  the 
matching  algorithm  at  the  bottom  four  levels.  Again,  we 
have  subsampled  the  vector  field  for  display  purposes. 


Figure  9:  Optic  fundus  test  Images. 

These  are  real  images  taken  at  two  successive  time 
instants.  Note  the  large  change  in  mean  intensity 

For  th-  images  used  in  this  experiment  any  three 
dimensional  effects  due  to  the  movements  of  the  eye  can 
be  safely  ignored,  so  the  misregistration  is  due  to  a  rigid 
motion  in  the  plane  (viz.,  eye  movements  and 
mis-alignments  in  the  digitization  process).  The  problem 
then  is  to  find  the  rigid  motion  which  will  bring  the 
images  into  register. 
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Figure  10:  Computed  displacement  vectors. 

The  displacement  vectors  at  levels  4  through  7  obtained 


in  the  optic  fundus  experiment.  Note  that  the 
displacement  fields  at  have  a  distinct  rotational 
component. 


To  measure  the  accuracy  of  the  matching  algorithm  we 
generated  a  vector  field  by  computing  the  rigid 

transformation  that  best  fit  the  computed  displacement  field. 
This  field  was  subtracted  from  the  displacement  computed 
by  the  matching  algorithm.  A  two  dimensional  histogram 
of  the  row  versus  the  column  components  of  the  difference 
vectors  is  shown  in  Figure  11.  Figure  12  shows  the  same 
type  of  histogram  with  the  differences  being  taken  only  at 
a  set  of  interesting  points.  In  this  case  76  Interesting  points 
were  computed  at  the  finest  level  using  the 

Kitchen-Rosenfeld  comer  finder  [11]  and  then  ”OR-ing”  the 
points  up  the  pyramid. 

The  central  bucket  of  the  histograms  corresponds  to  an 
error  of  at  most  1/2  pixel  in  either  of  the  row  or  column 
directions.  About  35  percent  of  the  points  in  the  histogram 
of  Ficure  11  are  in  the  central  bucket  whereas  in 
Figure  12,  approximately  55  percent  of  the  points  are  in  the 
central  bucket.  This  indicates  that  better  accuracy  can  be 
obtained  using  an  interest  operator.  However,  the 
concentration  of  points  around  the  central  bucket  in 
Figure  11  suggests  that  global  statistics  of  the  displacement 
field  can  be  accurately  obtained  without  the  use  of  an 
interest  operator. 


6.0  Future  Directions 

Our  experiments  have  shown  that  hierarchical  matching 
provides  an  excellent  method  for  the  computation  of 
displacement  fields.  However,  a  thorough  evaluation  of  the 
effects  of  various  parameters  and  algorithmic  options  on  the 
accuracy  of  the  computed  fields  has  yet  to  be  carried  out. 
These  include: 

1.  The  size  of  the  sample  window;  Can  windows  as  small 
as  3  x  3  provide  adequate  matches  ? 

2.  The  shape  of  the  sample  window;  How  do 
center-weighted  windows  (e.g.,  Gaussian  windows  [6]) 
improve  accuracy  when  the  disparity  field  is  nc.t 
smoothly  varying  9 

3.  The  method  of  computing  the  bandpass  pyramids ;  This 
issue  has  two  parts:  a)  effect  of  the  method  on  'he 
accuracy  of  the  displacement  field;  and  b)  efficient 
computational  implementations 

4.  The  use  of  normalized  correlation;  Bandpass  filtering 
and  the  small  3x3  search  areas  seem  to  eliminate  the 
need  to  do  normalized  correlation.  This  may  not 
remain  true  if  sample  windows  smaller  than  8  x  8  are 
used. 
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Figure  11:  Distribution  of  difference  vectors. 

The  error  histogram  obtained  by  differencing  the 
computed  displacement  field  from  the  field  generated 
by  the  translational  and  rotational  paramters  derived 
from  the  computed  field. 
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Figure  12:  Distribution  it  Interesting  points. 

Using  the  same  data  as  in  previous  figure,  this 
histogram  was  obtained  by  including  only  the 
displacement  vectors  at  interesting  points. 
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Wh.le  the  absolute  accuracy  (i.e,  percent  correct)  of  the 
displacement  field  is  important,  the  degree  and  nature  of 
tolerance  of  errors  greatly  depend  on  its  intended 
application  For  example,  in  image  registration  applications, 
the  statistical  distribution  of  errors  is  more  rignificant  than 
errors  at  specific  points.  On  the  other  hand,  some  of  the 
structure  from  motion  algorithms  require  highly  accurate 
displacements  at  a  few  speicific  points.  Hence,  a  study  of 
the  evaluation  criteria  of  the  displacement  fields  with 
careful  consideration  of  the  application  domains  is  an 
important  area  of  future  work 

Another  important  research  piu'-iem  is  a  systematic  study 
of  interest  operators  and  sharpness  measures.  It  is 
necessary  to  understand  how  they  relate  to  the  degree  of 
confidence  in  the  displacements  obtained  at  image  points. 
This  issue  is  also  linked  to  the  issue  of  how  to  obtain  an 
dense  accurate  displacement  field  from  a  sparse  field  (i.e, 
one  computed  only  at  points  of  high  confidence)  or  from  a 
dense,  but  inaccurate  field  (with  the  knowledge  of  the 
degree  of  confidence  of  the  displacements).  Typically,  these 
involve  applying  smoothing  or  interpolation  processes  to  the 
displacement  fields.  It  is  important  to  note  here  that  there 
are  hierarchical  techniques  [7]  that  dramatically  improve  the 
speed  of  some  of  the  iterative  smoothing  processes.  In  face 
it  appears  that  hierarchical  matchi.,0  and  interpolation  can 
be  performed  together. 

7.0  Summary 

In  this  paper  we  have  described  the  implementation  of  a 
hierarchical  correlation  process  in  the  processing  cone 
architecture  of  the  VISIONS  Image  Operating  System .  Two 
representative  experiments  were  presented  to  describe  its 
performance.  The  hierarchical  process  involves  matching 
band-pass  filtered  images  at  different  levels  of  resolution 
under  the  control  of  a  coarse-to-fine  search  strategy.  We 
described  three  different  techiques  for  computing  the 
band-pass  image  pyramid.  We  also  discussed  the  issues 
involved  in  the  correlation  and  search  processes. 

We  have  shown  how  this  hierarchical  correlation 
technique  both  reduces  the  costs  of  correlation  matching 
and  avoids  the  mismatch  problem  that  occurs  in  areas  of 
high  feature  density.  The  results  of  our  experiments 
indicate  that  it  is  also  insensitive  to  noise  and  is  able  to 
detect  at  least  small  amounts  of  rotation  between  images. 
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Oh  abstract 

gUhjmulation  of  shape  from  shading  is  presented  in  which 
snrfj^gfflrif'nt  at  ion  is  related  to  image  irradiance  without  re¬ 
quiring  detailed  knowledge  of  either  the  scene  illumination  or 
the  albedo  of  the  surface  material.  The  case  for  uniformly 
diffuse  reflection  and  perspective  projection  is  discussed  in  detail. 
Experiments  aimed  at  using  the  formulation  to  recover  surface 
orientation  arc  presented  and  the  difficulty  of  nonlocal  computa¬ 
tion  discussed.  Wc  present,an  algorithm  for  reconstructing  the 
3-D  surface  shape  once  surface  orientations  are  known. 


1  INTRODUCTION 

When  the  human  visual  system  processes  a  single  image, 
e.g.,  Figure  I,  it  returns  a  perceived  3-D  model  of  the  world,  even 
when  that  image  has  limited  contour  and  texture  information. 
This  3-D  model  is  underdetermined  by  the  information  in  the 

2- D  image;  the  visual  system  has  used  the  image  data  and  its 
model  of  visual  processing  to  reconstruct  the  3-D  world.  While 
there  are  many  information  sources  within  the  image,  shading  is 
an  important  source.  Facial  make-up  or  a  cartoonist’s  shading, 
is  an  everyday  example  of  the  way  shape,  as  perceived  by  our 
human  visual  system,  is  manipulated  by  shading  information. 

A  primary  goal  of  computer  vision  is  to  understand  this 
process  of  reconstructing  the  3-D  world  from  2-D  image  data, 
to  discover  the  model,  or  models  that  allow  2-D  data  to  infer 

3- D  structure.  The  focus  of  this  work  is  the  recovery  of  the  3-D 
orientation  of  surfaces  front  image  shading. 

We  present,  a  formulation  of  the  sbape-from-shading  prob¬ 
lem,  i.e.,  recovering  3-D  surface  shape  from  image  shading, 
that  is  derived  under  assumptions  of  perspective  projection, 
uniformly  diffuse  reflection,1  and  constant,  reflectance.  This  for¬ 
mulation  differs  from  previous  approaches  to  the  problem  in  that 
we  neither  make  assumptions  about  the  surface  shape  [2],  nor 
use  direct,  knowledge  of  the  illumination  conditions  and  the  sur- 

The  research  reported  herein  was  supported  hv  the  Itefense  Advanced 
f>srarch  Projects  Agency  under  Contract  V'DAfl03-s;i-C-0027  and  by  the 
National  Aeronautics  and  Space  Administration  under  Contract  NASA 
tM666*l.  There  contracts  are  monitored  by  the  U.S.  Army  Engineer 
Topographic  Laboratory  and  by  the  Texas  ACM  Research  foundation  for 
the  Lyndon  FI.  Johnson  Space  Center. 

Wc  prefer  the  expression  inotropic  ecattering  to  either  uniformly  dif¬ 
fuse  reflection,  or  Lambertian  reflection,  as  it  emphasis  that  scene 
radiance  is  isotropic.  However,  uniformly  diffuse  reflection,  and  Lambertian 
reflection  are  the  terms  commonly  used  to  indicate  that  the  scene  radiance 
is  isotropic. 


Figure  X  Shape  from  Shading. 


faro  albedo  [3].  The  cost  we  incur  for  dispensing  with  these 
restrictions  is  the  introduction  of  higber-order  differentials  into 
the  equations  relating  surface  orientation  and  image  irradianre. 
The  benefits  we  gain  allow  us  to  investigate  the  strength  of  the 
constraint  imposed  by  shading  upon  shape.  Past  attempts  to 
solve  the  sbape-from-shading  problem,  as  well  tvs  our  own  efforts, 
have  been  aimed  at  recovering  surface  shape  from  image  patches 
for  which  the  reflectance  (albedo)  can  be  considered  constant. 

Previously  we  examined  the  influence  exerted  by  the  as¬ 
sumption  of  uniformly  diffuse  reflection  [1],  and  indicated  that 
the  equations  relating  surface  orientation  to  image  irradianre 
ronltl  be  expected  to  yield  useful  results  even  in  cases  in  which 
the  reflection  is  not  uniformly  diffuse.  In  that  examination  we  as¬ 
sumed  orthographic  rather  than  perspective  projection.  A  com¬ 
parison  of  our  previous  work  with  this  paper,  however,  shows 
that  the  structure  of  the  formulation  is  not  dependent  upon  the 
projection  used. 

If  we  add  additional  assumptions,  e.g.,  constraints  on  the 
surface  type,  we  can  simplify  the  relationship  between  surface 
orientation  and  image  irradianre.  While  it  is  not  our  goal  to  add 
constraints  upon  surface  type,  the  assumption  that  the  surface 
is  locally  spherical  allows  the  approximate  surface  orientation  to 
he  recovered  by  local  computation. 


Figure  2  Coordinate  Frame.  X,Y,Z  are  the  scene  coor¬ 
dinates.  U.V  the  image  coordinates,  and  the  image  plane  is  located  a 
(Usance  /  from  the  scene  coordinate's  origin  -  the  projection  center. 
a  is  the  angle  between  the  Z  axis  (the  viewing  direction)  and  the  rav 
of  light  from  the  scene  point  [j,  y ,  *)  to  the  image  point  (u.  t1).  /  and 
m  are  the  X  and  Y  components  of  the  surface  normal  n. 

2  THE  COORDINATE  FRAME  AND 
REPRESENTATION  OF  SURFACE 
ORIENTATION 

The  coordinate  system  we  use  is  depicted  in  Figure  2.  X,Y,Z 
are  the  sceue  coordinates  and  U,V  are  the  image  coordinates. 
The  image  and  scene  coordinates  are  aligned  so  that  X  and  U 
axes  are  parallel,  as  are  the  Y  and  V  axes.  The  U  and  V  axes  are 
inverted  with  respect  to  the  X  and  Y  axes,  so  that  positive  X  and 
Y  coordinates  will  correspond  to  positive  U  and  V  coordinates. 
The  image  plane  is  located  at  a  distance  /  from  the  (perspective) 
projection  center,  the  origin  of  the  scene  coordinates.  A  ray  of 
light  from  the  point  (f,  y,  c)  in  the  scene  to  the  image  point  (ti,  r) 
makes  an  angle  o  with  the  viewing  direction  (i.e.,  the  Z  axis). 

There  arc  many  parametemations  of  the  surface  orienta¬ 
tion:  we  choose  to  use  (/, »»),  which  are  the  X  and  Y  components 
of  the  unit  surface  normal.  In  Figure  2,  n  is  the  unit  normal 
of  the  surfac^  patch  located  at  (z,j/.z);  /  and  m  are  the  com¬ 
ponents  of  this  surface  normal  in  the  X  and  Y  directions.  From 
our  viewing  position  we  can  see  at  most  half  the  surfaces  in  the 
scene  (i.e.,  those  that  face  the  viewer).  The  Z  component  of  the 
surface  normal  has  the  magnitude  v^l  —  /"*  —  rn2,  the  sign  deter¬ 
mining  whether  the  surface  is  forward-facing  (has  a  positive  Z 
component),  or  backward-facing  (has  a  negative  Z  component). 
For  large  off-axis  angle  a,  we  see  backward-facing  surfaces  near 
the  edges  of  ohjects.  The  two  components  of  the  surface  normal, 

/  and  rn,  do  not  provide  an  adequate  parameterization  of  the 
surface  in  this  case.  Additionally,  we  need  to  know  the  sign  of 
the  Z  component.  Here  we  restrict  ourselves  to  forward-facing 
surfaces  This  minor  restriction  amounts  to  assuming  that  a  is 


not  too  large  and  that  we  are  not  adjacent  to  an  object  s  edge 
Consequently,  in  this  discussion  we  assume  that  the  Z  component 
of  the  surface  normal  is  positive  and  that  /  and  m  constitute  an 
adequate  parameterization  of  scene  surfaces. 


3  IMAGE  IRRADIANCE 

The  image  irradiance  equation  we  use  is  [4] 

/(ti,  t  )  =  /?(/,  rn)  cos4  o  , 

where  /(ti  c)  is  the  image  irradiance  as  a  function  of  the  image 
coordinates  ti  and  t',  and  R[l,m)  is  the  surface  radiance  as  a 
function  of  I  and  rn,  the  components  of  the  surface  normal.2  The 
term  cos4  o  represents  the  off-axis  effect  of  perspective  projec¬ 
tion.  When  a  is  small,  cos4o  is  approximately  unity,  we  then 
have  the  more  familiar  form  of  the  image  irradiance  equation. 
From  Figure  2  we  see  that 

cos  a  =  — ■  — -  ^ - . 

y/u-  +  «’2  +  p 

Differcntiaiing  the  image  irradiance  equation  with  respect 
to  the  image  coordinates  u  and  e,  we  obtain 

K,  =  +  Hm’n-t  , 

I*v  —  Ril  ,i  +  /?mro„  , 

^ uu  Rlitu  "b  /fmmtn, ,  -t-  2Rimlumu  +  /?,/„„  ■+•  It, „tnuu 

^t’r  Itnlv  H-  Ilmmtnv'  +  2 Rimlvmv  •+■  /?(/[,„  *+■  R,nu i ., .,  , 

Inv  KUIJV  ■+■  Rtn,nm  um ,,  -f  Ri,n(lumv  +  lvrnu) 

+  Rtluv  +  Rmmuv  , 

where  subscripted  variahles  denote  partial  different  at  ion  with 
respect  to  the  subscript(s),  and 
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2Image  irradiance  is  the  light  (lux  per  unit  area  falling  on  the  image,  i.e., 
incident  dux  density.  Scene  radiance  is  the  light  dux  per  unit  projected  area 
per  unit  solid  angle  emitted  from  the  scene,  i.e  emitted  dux  density  per 
unit  solid  angl*. 
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If  we  are  to  use  these  expression  to  relate  intake  measure¬ 
ments,  c.g,,  I'uu.  to  surface  parameters  /  and  m,  then  we  must 
remove  the  derivatives  of  R. 


4  UNIFORMLY  DIFFUSE  REFLECTION 

To  provide  the  additional  constraints  we  need  for  relating 
surface  orientation  ta  image  irradiance,  we  introduce  constraints 
that  relate  properties  of  that  is,  constraints  that 

specify  the  relationship  between  surface  radiance  and  surface 
orientation  such  constraints  are 


Substituting  these  relationships  for  Ru  and  Rmm  in  the 
expressions  for  and  we  obtain 

, ,  si*-  m  ,  o  1  I* , 

ru  (  )  +  rnu  {  s/ur/iuj/t|m  = 

I  vu  -Itll  UU  -  Rmm  UU  » 

J  _ pj2  j  _ 

l^v  (  j  )  +  TTlv  (  )  +  -  ^v^v\^lm  “ 

I vv  ■"  » 

I  ^2  j  __  p 

[IJA  ■  — )+m„»n„(  )  + /u»n„  + = 

?uv  RAuv  I'mttluti 


(1  -  12)Ru  —  (1  -  nr)Ftmm  . 

(Ru  ffmmj/ttl  “  (f  »n“ )Rim 

where  Ru  is  the  second  partial  derivative  of  R  with  respect  to 
/,  is  the  second  partial  derivative  of  R  with  respect  to  rn, 

and  Rim  is  the  second  partial  cross-derivative  of  R  with  respect 
to  /  and  in. 

I  hese  two  partial  differential  equations  embody  the  as¬ 
sumption  of  uniformly  diffuse  reflection.  For  uniformly  diffuse 
reflection,  R(l,m)  has  the  form 

R(l,  m)  =  at  +  bm  +  c\/l  —  I2  —  m1  +  d  , 

where  n,b,c.  and  d  are  constants,  their  values  depending  cn 
illumination  conditions  and  surface  albedo.  Note  that  /,r?i,  and 
x/ 1  —  (~  —  ni"  are  the  components  of  the  unit  surisce  normal  in 
the  directions  A  ,1",  and  Z.  R(t,m )  can  be  viewed  as  the  dot 
product  of  the  surface  normal  vector  (l,  in,  y/T—  l-  -  trr)  and  a 
vector  (ti.  h,  c)  denoting  illumination  conditions.  As  the  value  of  a 
dot  product  is  rotationally  independent  of  the  coordinate  system, 
the  scene  radiance  is  independent  of  the  viewing  direction  — 
which  is  the  definition  of  uniformly  diffuse  reflection. 

It  is  clearly  evident  that  R(l,m)  =  al  +  bin  + 
<Vl  —  —  m-  +  d  satisfies  the  pair  of  partial  differential  equa¬ 

tions  given  above.  In  [l]  we  showed  that  /?(/,« i)  =  al  +  bm  + 
cx/l  —  I-  —  in-  +  d  is  the  solution  of  the  pair  of  partial  differential 
equations.  These  partial  differential  equations  arc  an  alternative 
definition  of  uniformly  diffuse  reflection. 

It  is  worthy  of  note  that  R(l,  m)  —  al+bm+c\/\  -  P  -  m*  + 
d  includes  radiance  functions  for  multiple  and  extended  illumina¬ 
tion  sources,  including  that  for  a  hemispherical  uniform  source 
such  as  the  sky.  Of  course,  at  a  self-shadow  edge  R  is  not 
differentiable,  so  that  the  surfaces  on  each  side  of  the  self-shadow 
boundary  have  to  be  treated  separately.  The  assumption  of 
uniformly  diffuse  reflection  restricts  the  class  of  material  surfaces 
being  considered,  not  the  illumination  conditions. 

From  the  constraints  for  uniformly  diffuse  reflection,  we 
derive  the  relationships 

n„-  , 

Rmm  =  1  Rim 

lm 


By  removing  7f/m  and  substituting  the  expressions  for  /?< 
and  defined  by  the  expressions  for  l'u  and  /',,  we  produce 
two  partial  differential  equations  relating  surface  orientation  to 
image  irradiance: 

aOluu  +  /?0rnuu  -  ailuv  -  /S~imuv  =  \8l'uu  -  \ , 

nflfu  T  A0rnvv  —  abluv  flbmuv  =  \0Ifvv  \blfuv  , 

where 

a  =  I'utn„  -  I'vmu  , 

A  -■  I'ju  -  I'jv  . 

~i  —  lu~(  1  —  m")  +  rnu“(  1  —  /2)  +  2lumulm  , 

b  —  lv‘(  1  —  in' )  +  tnvs(  1  —  l~)  +  2lvmvlm  , 

6  —  U(1  -  m-)  +  mumv(  1  -  l~)  +  (lumv  +  lvmu)lm  , 

X  ^  Lttty  — •  lvmu 

These  equations  relate  surface  orientation  to  image  ir¬ 
radiance  by  parameter-free  expressions.  We  make  no  as¬ 
sumptions  about  surface  shape,  nor  do  we  need  to  know  the 
parameters  specifying  illuminant  direction,  illuminant  strength, 
and  surface  albedo.  Our  assumptions  are  about  the  properties 
of  reflection  in  the  world;  these  alone  are  sufficient  to  relate 
surface  orientation  to  image  irradiance.  The  above  equations 
have  been  derived  for  the  case  of  perspective  projection;  for  or¬ 
thographic  projection,  the  primed  (')  quantities  arc  replaced  by 
their  umprimed  counterparts,  e.g.,  /(,  is  replaced  by  /„.  The 
form  of  the  equations  is  not  a  function  of  the  projection  used. 

5  RECOVERY  OF  SURFACE 
ORIENTATION 

It  is  difficult  to  solve  the  equations  relating  surface  orienta¬ 
tion  to  image  irradiance,  and  thus  to  recover  surface  shape  from 
observed  image  irradiance  We  have  used  numerous  integration 
schemes  that  characterize  two  distinct  approaches.  The  two 
differential  equations  can  be  directly  integrated  in  a  step-by-step 
manner  or,  given  some  initial  solution,  a  relaxation  procedure 
may  be  employed.  The  difficulties  that  arise  are  twofold:  numeri¬ 
cal  errors  and  multiple  solutions. 

Solutions  of  the  equation  \  =  0  (the  developable  surfaces, 
e.g.,  a  cylinder)  are  also  solutions  of  the  equations  relating  sur¬ 
face  orientation  to  image  irradiance.  If  the  image  intensities 
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were  known  in  analytic  form,  the  analytic  approach  to  solving 
the  equations  could  then  employ  boundary  conditions  to  select 
the  appropriate  solution.  However,  since  the  analytic  form  for 
the  image  intensities  is  unknown,  numerical  procedures  must 
be  employed  The  use  of  such  procedures  to  directly  integrate 
the  equations  inevitably  introduces  small  errors.  Such  errors 
‘mix  in'  multiple  solutions  even  when  those  solutions  are  incom¬ 
patible  with  the  boundary  conditions.  Instability  of  the  numeri¬ 
cal  scheme  seems  responsible  for  the  fact  that  such  errors  even¬ 
tually  dominate  the  recovered  solution.  A  scheme  that  is  repre¬ 
sentative  of  our  various  trials  at  direct  integration  is  outlined. 

We  transform  our  equations  into  finite-difference  equations 
hy  using  a  three-point  formula  for  the  differentials  of  l  and  m.  If 
l{i,  j)  and  m(i',y)  are  the  values  of  /  and  m  at  the  (i.y'Jth  pixel  in 
the  image,  then  at  this  pixel  we  use  the  finite-difference  formulas, 

/(i  +  l,j)-/(i  -  l,j) 

U  -  2  ’ 

luu  =  +  1,  j)  +  H i  -  1  ,j)  -  2/(i  ,j)  , 

_H i  +  1,  j  +  1)  +  /(i  -  1,  j  -  1 ) 

Mi  t'  “  A 

4 

/(|  +  1,J-  1)+  <(«  —  +  1) 

4 

and  similar  formulas  for  the  other  differentials.  If  we  consider 
the  3  x  3  image  patch  centered  on  the  (i,  j)th  pixel, 
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0 
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0 
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o 
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0 
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we  could  hope  that  the  two  finite  difference  equations,  relating 
the  eighteen  values  of  /  and  rn  on  the  patch,  could  he  solved 
explicitly  for  /(/  +  1  ,j  +  1)  and  m(i  +  1  ,j  +  1),  (the  (&)  cell). 
Such  a  solution  would  allow  /  and  m  at  the  (&)  cell  to  be  cal¬ 
culated  from  the  i'a  and  rn’s  at  the  (o)  cells.  Starting  at  some 
boundary  at  which  we  know  /  and  rn  at  the  (o)  cells,  we  can 
move  along  the  image's  row  and  then  along  the  successive  rows, 
calculating  I  and  m  at  the  (&)  cell.  However,  examination  of  the 
surface-orientation-to-imagc-irradiance  equations  shows  that  we 
cannot  solve  these  equations  explicitly  for  and  muv  and  that, 
consequently,  we  cannot  obtain  finite-difference  equations  that 
are  explicit  in  the  /  and  m  of  the  (£•)  cell. 

We  avoid  this  difficulty  hy  combining  the  two  surface- 
orieutation-to-image-irradiance  equations  into  one  and  using  sur¬ 
face  continuity  to  provide  the  additional  equation.  Removing  lut, 
and  niuv  from  the  differential  equations,  we  have 

-  llw)  +  P(tmuu  -  im„)  =  *(«/'„„  -  1  /'„)  . 

Surface  continuity  requires  that  $x£y  —  g~ftx,  from  which  it 
follows  that 

ft,(l  —  m2)  +  mv/m  =  mI(l  -  I2)  +  lxlm 


Provided  that  u  and  u  are  small  compared  with  z  (e.g.,  in  the 
eye  or  in  a  standard-format  '•amera),  then 

/„(  1  -  m2)  +  mvlm  =  m„(l  -  l~)  +  IJm 

These  two  equations,  which  do  not  involve  or  rnu„,  form  a 
basis  for  finite  difference  equations  that  calculate  I  and  rn  at  the 
(-)  cell  from  values  of  /  and  m  at  (+)  cells. 


The  results  ohtained  with  the  ahove  integration  scheme, 
together  with  many  variations  of  it,  are  poor.  Accurate  values 
for  /  and  m  are  obtained  only  within  approximately  five  to  ten 
rows  of  the  known  houndary.  This  is  the  case  for  noise-free 
image  data.  These  results  can  he  understood  hy  examination 
of  the  finite-difference  equations.  The  explicit,  expressions  for 
I  and  m  at  the  (-)  cell  are  functions  of  the  differences  of  / 
and  rn  at  the  (  +  )  cells.  Such  schemes  are  usually  numerically 
unstahle,  making  step-hy-step  integration  impossible.  While 
the  failure  to  find  a  stable  numerical  scheme  does  not  imply 
that  one  does  not  exist,  our  difficulty  highlights  the  problem 
of  finding  numerical  schemes,  hased  on  differential  models,  to 
propagate  information  from  known  boundaries.  (One  wonders 
whether  nature  experienced  the  same  difficulties  when  designing 
the  human  ’-ision  system.) 

Although  the  alternative  to  direct  integration,  a  relaxation 
nrocedure  to  solve  the  equations,  seems  to  offer  relief  from  the 
numerical  instability  of  direct  integration,  it  nevertheless  poses 
its  own  problems.  The  approach  we  used  parallels  the  one  in 
|3]  for  solving  the  image  irradiance  equation  when  the  surface 
albedo  and  illumination  conditions  are  known.  For  each  image 
pixel  we  form  three  error  terms:  the  residuals  associated  with 
the  two  surface-orientation-to-image-irradiancc  equations,  and 
with  the  one  surface  continutiy  equation.  Minimizing  the  sum 
of  the  errors  over  the  whole  image  with  respect  to  /  and  m  at. 
each  pixel  produces  an  updating  rule  for  /  and  rn  at  each  pixel. 
Given  an  initial  solution,  i.e.,  assignment  of  values  for  /  and  rn 
at  each  pixel,  a  relaxtion  scheme,  like  the  one  descrihcd,  is  useful 
only  if  it  converges.  While  the  constraint  imposed  bv  the  under¬ 
lying  model  is  most  important  in  ensuring  convergence,  the  im¬ 
portance  of  a  good  initial  solution  for  a  relaxation  method  can¬ 
not  be  overemphasized.  Simplifying  the  two  partial  differential 
equations  (hy  using  additional  assumptions)  provides  a  method 
for  obtaining  an  good  initial  solution. 

The  spherical  approximation  assumes  that  we  are  viewing 
a  spherical  surface.  This  implies  /„  =  0,  mx  =  0,  and  lx  —  m„, 
namely,  constant  curvature  that  is  independent  of  direction 
Provided  that  u  and  v  are  small  compared  with  z,  then  /„  = 
0,mu  =  0  and  /„  =  m„.  For  this  case,  the  partial  differential 
equations  become  relationships  between  image  irradiance  and  its 
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derivatives,  on  the  one  hand,  and  the  components  of  the  surface 
normal,  on  the  other: 


I-™2  =  c 

Im 

c 

/m 

The  spherical-approximation  results  for  perspective  projec¬ 
tion  are  similar  to  those  Pcntland  was  able  to  obtain  [2]  for 
orthographic  projection  through  local  analysis  of  the  surface. 
Besides  providing  a  mechanism  for  obtaining  an  initial  solution 
for  a  relaxation-style  algorithm,  they  allow  surface  orientation 
to  be  estimated  by  purely  local  computation.  Such  an  estimate 
will  be  exact  when  the  surface  is  locally  rpherical. 

The  results  of  our  experiments  with  relaxation  procedures 
are  easily  summarized:  the  relaxation  procedures  were  not  con¬ 
vergent.  While  such  nonconvergence  is  hardly  unusual,  the 
reasons  for  failure,  however,  are  instructive.  The  residuals  as¬ 
sociated  with  both  the  surface-orientation-tc-image-irradiance 
equations,  and  the  surface  continuity  equations  remain  small 
during  the  relaxation,  even  when  the  solution  is  starting  to 
diverge.  Of  course  the  residuals  arc  not  as  small  as  they  arc 
when  on  the  verge  of  solution,  but  tbey  are  small  enough  to 
make  one  believe  that  a  solution  has  been  obtained,  particularly 
when  the  image  is  not  noise-free.  Apparently  the  equations  are 
insensitive  to  particular  values  of  l  and  m,  being  more  concerned 
with  the  values  of  mu,  and  m„.  As  with  direct  integration, 
relaxation  models  need  boundary  conditions  to  select  a  particular 
solution.  We  used  various  boundary  conditions  in  our  relaxation 
experiments,  but  it  is  difficult  to  believe  that  a  model,  apparently 
insensitive  to  surface  orientations,  could  be  overly  influenced  by 
the  surface  orientations  at  a  boundary. 

Our  two  approaches,  direct  integration  and  relaxation,  have 
not  yielded  a  computational  solution  to  the  problem  of  recover¬ 
ing  surface  orientation  from  shading.  The  attractiveness  of  lo¬ 
cal  computation  is  clear;  it  has  neither  numerical  instability  nor 
divergent  behavior,  but  the  cost  it  imposes  is  that  assumptions 
must  be  made  about  surface  shape.  A  compromise  between 
some  local  computation  and  some  information  propagation  may 
offer  an  approach  that  is  not  overly  restrictive  in  its  assump¬ 
tions  about  surface  shape.  However,  tbe  question  needs  to  be 
considered:  Is  the  model  underconstrained?  Is  shape  recovery 
dependent  on  information  other  than  shading?  What  other  in¬ 
formation  (that  is  obtainable  from  the  image),  is  necessary  to 
enable  the  construction  of  effective  shape-recovery  algorithms? 


6  RECONSTRUCTION  OF  THE  SURFACE 
SHAPE 

Surface  orientation  is  not  the  same  as  surface  shape. 
However,  once  we  have  obtained  the  surface  orientation  as  a 
function  of  image  coordinates,  i.e.,  l(u,  v)  and  m(u,  v),  we  can  use 
these  to  reconstruct  the  surface  shape  in  the  scene  coordinates 
X,Y,Z  We  derive  a  suitable  formula. 


Suppose  we  know  the  depth  z0  at  scene  coordinates 
(xo,  l/o,  *o)i  corresponding  to  («o,t’o)  in  the  image.  For  the  point 
{xo  +  Ax,  yo  +  Ay)  we  use  the  approximation 


-I  _1_  A  ~ 


+  Ay 


zOiVo 


dy 


■Co.  Vo 


Similarly, 


c(zj  -  Aar,  y\  -  Ay)  =  i(ari,  yi)-  Ax 


.  Oz 
~  Ay 

dy 


If  ari  =  aro  +  Ax  and  y\  =  y0  +  Ay,  then 


=  *Uo,  tto)+ 

£  OX 

V\  ~  l/o ,  Oz 
2  ,Jy 


*o,yo 


Oz 
+  Tx 
Oz 
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Using  the  perspective  transformation  u  —  —  /*  and  u  = 
to  remove  x  and  y,  we  obtain 


*(»i.t'i)  =  r(u0,  t>o)x 


*/+ m  feL.,.. +  £!.,,.,)+ Mfc 

!  +  s?l 

1  ) 

2/ +  «i(£f  feLliWl)+ »i(ff  j 

+  **  1 
^  0v 

'tlo.Vo  *  1 

) 

’ui.w, 

As  “  TPw and  =  TPw’ wc  have  thc  mcans  of 

reconstructing  the  surface  in  scene  coordinates  from  the  values 
of  surface  orientation  in  image  coordinates. 


7  CONCLUSION 

In  th*s  formulation  of  the  shape-from-shading  task,  wc  have 
eliminated  the  need  to  know  the  explicit  form  of  the  scene 
radiance  function  by  introducing  higher-order  derivatives  into 
our  model.  This  model  is  applicable  to  natural  scenery  without 
any  additional  assumptions  about  illumination  conditions  or 
the  albedo  of  the  surface  material.  However,  without  a  com¬ 
putational  scheme  to  reconstruct  surface  shape  from  image  ir- 
radiance  we  may  wonder  if  we  have  surrendered  too  much.  The 
difficulties  of  finding  a  computational  scheme  must  induce  oue 
to  ask  whether  the  model  is  underconstrained.  Have  we  applied 
too  few  restrictions,  thereby  making  shape  recovery  impossible? 
Notwithstanding  the  general  concern  about  underconstraint  of 
the  model,  the  numerical  difficulties  encounted  makes  local  com¬ 
putation  of  scene  parameters  attractive.  Information  propaga¬ 
tion  methods  must  always  cope  with  the  problem  of  accumulated 
errors.  In  our  model,  however,  to  achieve  local  computation  we 
must  make  assumptions  with  regard  to  surface  shape.  What 
other  information,  besides  shading,  do  we  need  to  know  if  we  are 
to  recover  surface  shape?  Can  we  find  moderate  restrictions  that 
allow  mostly  local  computation  of  tbe  surface  shape  parameters? 
We  are  actively  engaged  in  the  pursuit  of  such  procedures. 
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Abstract 


In  tins  paper  we  show  t  how  assumptions  and 
information  concerning  the  external  world  properties  of 
"horizontal”'  and  “vertical"  can  aid  in  the  analysis  of 
images,  even  at  the  very  lowest  levels  of  processing.  First 
we  review. the  pervasiveness  of  the  force  of  gravity,  and 
its  influence  on  most  natural  image  understanding 
systems.  Next.,;  we  derive  *  several  fundamental 
mathematical  results  relating  phenomena  in  both  the 
gradient  space  and  the  image  space  to  the  external  world 
attributes  of  horizontal  and  vertical.  We  then  show  how 
these  results  interrelate  throe  imaging  phenomena;  the 
surfaces  in  the  image,  the  external  sensor  parameters,  and 
the  environmental  labels.  We  detail  how,  in  general 
specific  information  regarding  any  two  of  these 
phenomena  can  be  used  to  quantitatively  derive  the  third- 
occasionally  one  can  do  even  better.  Algorithms  for  such 
quantitative  derivations  are  presentee!,  including  two  , 
based  on  the  llougti  transform.  We  further  show. how* 
certain  environmental  perpendicularities  can  be  exploited 
\erv  efficiently,  and  even  elegantly;  ordinarily  complex 
math  simplifies  to  the  extent  that  environmental  distances 
can  he  directly  read  off  the  image.  The  power  of  such 
environmental  labels  is  then  demonstrated  by  an  analysis 
<d  the  source  of  ambiguity  ii.  a  simple  illusion-like  image 
configuration.  The  paper  concludes  with  an  analysis  of 
the  class  of  heuristics  that  have  been  invoked  throughout. 

I  he>  are  seen  to  he  instantiations  of  the  shape-from- 
t  ext  lire  meta-heuristics  tliat'Miear  implies  preferred"  and 
“preferred  implies  simple'”.* 
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Introduction 


Many  image  enwronments  are  immersed  in  a  force 
'hat  strongly  orients  objects  in  a  preferred  way.  The 
effects  of  this  force  are  often  so  pervasive  that 
environments  which  do  not  respond  to  it.  appear  (and  are 
often  called)  artificial  The  very  term  “natural  scene", 
vague  though  it  may  he,  does  at  least  seem  to  imply  an 
image  with  just  such  a  definite  environmental  orientation. 
Researchers  would  no  sooner  attempt  to  fully  analyze 
such  an  image  upside-down  than  they  would  if"  its  colors 
had  been  permuted. 


It  is  not  difficult  to  be  convinced  of  the  influence 
that  the  presence  of  gravity  has  on  the  design  of  iuwm> 
understanding  algorithms,  especially  in  higher  level 
processing.  Often  it  is  so  strong  that  it  deeply  permeates 
tin-  entire  system  as  an  implicit,  assumption.  The 
assumption  is  made  with  good  reason:  higher  level 
processing  can  he  more  efficient  Matching  to  models,  for 
example  can  start  with  both  the  detected  object  and  the 
modeled  object  mutually  aligned  in  the  preferred  (that  is, 
the  most  probabh  )  orientation. 


Ibis  research  was  sponsored  in  part  by  tile  Defense 
VYoii.-ed  Research  Projects  ,\gcuc\  under  i-onlr-icl 
\<S()().i<)-N-j-(  -0I-J7. 


However,  we  show  in  this  paper  that  assuming  the 
presence  of  gravity  can  aid  the  lower  levels  of  image 
processing  as  well.  This  is  a  bit  surprising,  since  many 
low-level  routines  do  work— and  ought  to  work-just  as 
well  with  images  inverted  (or  colors  scrambled). 
Nevertheless,  certain  heuristics  regarding  the  exploitation 
of  ‘  horizontal”,  “vertical”,  and  other  gravity-influenced 
concepts  can  make  low-level  shape  recovery  more  efficient 
as  well. 


These  heuristic  assumptions,  coupled  with  some 
fundamental  mathematical  resuits,  can  suggest  methods 
and  algorithms  on  the  same  level  as  other  ,?shape  from" 
methods,  such  as  shape  from  shading  or  skewed  symmetry 
II lorn  77;  Woodham  78;  Render  80a;  Ikcnchi  801.  The 
heuristics  themselves  usually  are  based  on  the  assumption 
of  some  preference:  here,  the  preference  for  horizontal  or 
vertical  surfaces  or  lines.  1  hns,  they  can  be  seen  as 
further  members  of  the  family  of  preference-based 
algorithms  linked  together  by  their  derivation  and  use  in 
a  common  methodological  paradigm,  called  shape  from 
texture  [Render  80b). 


2  The  Pervasiveness  of  Gravity 

The  presence  of  gravity  introduces  and  maintains  in 
natural  environments  a  decided  anisotropy.  Its  lines  of 
force  are  parallel  to  each  other  in  one  specific,  unchanging 
orientation.  I  his  orientation  induces,  usually  by  means  of 
general  energy  minimization  arguments,  configurations 
that  are  themselves  parallel:  natural  as  well  as  artificial) 
growth  is  often  aligned  with  the  field  Thus  trees  as  well 
as  bulk  nigs  often  have  parallel  sides,  and  are  parallel  to 
each  other,  further,  the  “ground  plane”  is  often  actually 
planar,  also  in  a  mniiinizational  reaction  to  the  force- 
whether  it  is  truly  the  ground,  or  artificially  made  so,  as 
in  a  floor.  I  he  combination  of  these  growth  parallelism 
and  ground  planes  further  induce  perpendicularities  again 
both  natural  and  artificial.  The  junction  of  trees  or 
animal  legs  to  forest  floor  (or  to  their  shadows),  or  the 
junctions  of  walls  to  ceilings  (or  object  legs  to  floors)  all 
occur  in  a  limited  class  of  orientations. 


wilier  examples  of  gravit)  s  explicit  and  implicit 
involvement  with  image  understanding  can  easily  he 
given.  In  some  domains,  of  course,  it  lias  no  influence  at 
all;  for  example,  blood  cell  analysis.  However,  in  most 
‘  natural  domains— or,  eqnivnfentlv,  most  “robotic” 
domains— its  pervasiveness  appears  to- be  so  extensive  that 
a  natural  scene  might  very  well  be  defined  as  one  in 
wliuh  considerations  of  gravitationally  induced 
orientations  are  lion-negligible.  In  other  words,  a  scene  is 
a  natural  scene  to  the  degr-e  that  it  would  be  difficult  to 
understand  rotated  or  upside-down.  (Tims,  images  of 
office  interiors  are  about  as  natural  as  handwriting 
samples:  both  are  more  natural  than  most  aerial 
photography;  high  magnification  scanning  electron 
micrographs  are  least,  natural  of  all  ) 
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3  Gradient  Space  Relations 

Perhaps  the  first  basic  relationship  that  deals  with 
the  environmental  labels  ‘  horizontal  and  'vertical  are 
the  terms  used  to  define  the  degrees  of  freedom  of  I  lie 
sensor  itself  Thc'si'nsor  orientation  terms  "pan”,  "tilt", 
and  roll'  imply  a  gravity-dependent  coordinate  system, 
and  in  fact,  are  defined  in  environmental  terms.  Pan  is 
sensor  rotation  in  the  horizontal  plane  tilt,  is  rotation  in 
tli  vertical  plane  passing  through  the  central  visual  ray. 
Roll  is  defined  as  rotation  in  the  image  plane,  and  its 
effect  is  therefore  dependent  on  lilt  and  pan:  in  the 
abseeiice  of  roll,  the  image  of  an  environmentally  vertical 
plane  that  passes  through  the  central  visual  ray  is  a 
relinally  vertical  line. 

A  second  basic  relationship  is  that,  in  terms  of  its 
use  in  computer  vision,  ‘horizontal”  is  simply  a  label  for 
a  unique,  preferred  surface  orientation.  In  terms  of  the 
gradient  space  [Shafer  83n],  it  is  a  single  labeled 
orientation  point  with  coordinates  (p,q)  =  (|>j,,q|,)- 
\ssuming  that  there  is  no  roll  in  the  sensor— that  is,  (tie  v- 
a\is  of  the  image  is  the  projection  of  an  environmentally 
vertical  plane-tlien  this  point  simplifies  to  (p,q)  =  (0,(1^ ).’ 

This  relationship  is  schematically  depicted  in  Figure 
1.  'Die  value  of  <n  is  easily  determinable:  assuming  the 
sensor  is  at  a  unit  s  distance  from  the  ground  plane  and 
has  no  roll,  then  the  central  visual  ray  intersects  the 
ground  at  tn  .  Note  that,  this  value  can  also  be  obtained 
by  a  simple  gravity  sensor.  The  v-axis  now  lies  in  an 
environmentally  vertical  plan"  as  in  T'ignre  2:  further,  due 
to  the  rotational  coupling  of  the  gradient  space  to  the 
image  space  [Kender  8t)b]  the  horizontal  orientation  has 
no  p  component. 


Figure  1:  Basic  relations:  sensor  configuration. 


It  is  not  hard  to  show  that  every  vertical  surface 
must  map  into  a  gradient  space  point  with  coordinates 
(p.q)  =  (p.-l/qi.  ).  This  fact  follows  from  the  general  rule 
that  the  gradients  of  surfaces  perpendicular  to  a  given 
gradient  (pi./li,)  must  satisfy  the  relation  PPp+Tli,  = 
every  vertical  surface  is  perpendicular  to  the  horizontal. 
Thus,  the  one-dimensional  family  of  vertical  surfaces 
maps  into  the  one-dimensional  locus  a  =  -l/qu  as  in 
figure  3  (Mackworth  73].  As  a  special  case,  if  There  is 
neither  sensor  roll  nor  tilt  then  the  gradient  space 
representation  for  the  horizontal  surface  is  infinitely  far 
along  the  positive  q  axis,  and  the  line  of  verticals  becomes 
I  lie  p  axis. 

More  generally,  similar  basic  relationships  hold  even 
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Figure  2:  Basic  relations:  Corresponding  image  space. 
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Figure  3:  Basic  relations:  Corresponding  gradients. 

if  there  is  a  roll  component.  It  is  not  hard  to  show  that  if 
there  is  information  available  about  the  sensor's  tilt  and 
roll,  t lion  the  gradient,  space  can  be  environmentally 
labeled  by  invoking  the  rotational  coupling  of  the  image 
space  to  ’the  gradient  space.  That  is,  if  tilt  is  given  as 
above  by  the  angle  whose  tangent  is  qj,,  and  the  roll 

component  is  given  as  t  with  respect  to  the  unrolled 

sensor  position,  then  the  point  in  the  gradient  space 
corresponding  to  the  horizontal  is  given  by  (0,qu)  similarly 
rotated  through  I.  The  line  of  verticals  rotates  likewise. 

Il  is  important  to  note  that  the  above  relations  hold 
independently  of  any  considerations  of  imaging  projection. 
They  are  true  for  both  orthography  and  perspective;  they 
are  irue,  in  fact,  even  with  no  image  at  all  1  hey  describe 
the  relations  of  the  gradient  space  to  environmental 

preference  labels  only.  (Alternatively,  one  can  use  the 
analogous  relations  that  prevail  when  surface  ori  -illations 
are  recorded  on  a  Gaussian  sphere  niap  [Horn  82]). 

Further,  they  can  be  used  in  either  direction:  given  sensor 
information,  the  gradient  space  can  be  labeled,  and  vice 
versa. 


4  Image  Space  Relations 

\*  (his  point  we  have  not  yet  used  any  image 
information  In  fact,  in  as  much  as  surfaces  exist  in  three- 
space.  they  cannot  appear  directly  in  an  image  at  all. 
However  it  is  interesting  to  note  that  the  same 
environmental  labels  of  horizontal  and  vertical  apply  to 
lines  as  well,  and  to  both  lines  in  three-space  and  fines  on 
the  retina.  Somewhat  paradoxically,  though,  the  size  of 
the  class  of  environment  idly  horizontal  lines  is  one 
dimension  greater  than  that  of  environmentally  vertical 
ones;  this  is  the  reverse  of  the  case  with  surfaces." 

Knv  ironmcntal  labels,  env  ironmental  line  segments, 
and  tin  sensor  parameters  are  related  in  several  ways.  To 
demonstrate  them,  consider  first  the  ease  of  perspective 
imaging  where  the  sensor  has  no  roll  component.  Scale 
the  image  plane  in  units  of  focal  length;  this  will  simplify 
the  mathematics  [Kender  801)] .  Now  image  a  scene 
consisting  of  vertical  lines  emerging  from  n  horizontal 
plane:  rather  like  a  vast,  stylized  forest.  The  result  is 
shown  schematically  in  Figure  1. 


of  vertical  lines 

Figure  4:  Vertical  linos  on  a  horizontal  surface. 

Because  the  class  of  environmentally  horizontal  lines 
is  so  large,  they  retain  no  distinguishing"  retinal  features. 
That  is,  any  line  in  the  image  can  he  the  image  of  an 
environmentally  horizontal  line.  About  the  only 
exploitable  horizontal  property  is  the  horizon  itself.  This 
line  the  limit  of  the  projection  of  the  horizontal  plane,  is 
a  retinally  horizontal.  It  has  the  equation  y  =  l/<n 
This  follows  from  the  basic  relationship  concerning 
vanishing  lines:  the  pic-  with  gradient  (p,q)  has 
vanishing  line  px-fqy  =  1  i:e.  c  (p,q)  =  (0,q(l ). 

More  interesting  is  the  behavior  of  the 
environmental  verticals.  They  form  a  more  restricted 
class,  and  their  images  are  more  constrained.  In 
particular,  any  environmentally  vertical  line  must  image 
into  a  retinal  line  that  passes  through  the  point  (x,y)  — 
(O.-qj .)•  I'his  follows  as  a  special  case  of  tfie  analvsis  of 
vanishing  points  [Kender  *9],  As  the  sensor's  tilt 


increases  so  that  its  central  visual  rav  approaches  the 
vertical  (i.e.  as  q.  approaches  0).  this  vanishing  point  of 
verticals  approaches  the  image  origin;  simultaneously  the 
horizon  moves  off  in  the  positive  y  direction. 

If  the  forest  scene  is  imaged  by  an  orthographic 
sensor,  very  little  environmental  information  remains  in 
the  image.  Nothing  at  all  remains  of  the  horizon.  All 
environmentally  vertical  lines  are  imaged  as  retinally 
vertical  lines;  they  have  no  finite  vanishing  point", 
Therefore,  under  orthography  there  are  no  image  cues  to 
sensor  tilt. 

As  with  the  gradient  space  relations,  these  image 
relations  hold  analogously  under  sensor  roll.  If  the  sensor 
is  orthographic,  then  the'narallel  family  of  image  verticals 
roll  proportionately.  If  tlie  sensor  uses  perspective,  then 
the  horizon  and  tli<>  vanishing  point  of  verticals  also  roll 
proportionately  and  Ilnur  expected  locations  are  easy  to 
compute,  given  tilt.  The  close  relation  between  tilt  "and 
roll  and  the  generated  horizon  is  well  known:  it  is 
exploited  in  the  artificial  horizon  instruments  of  airplane 
cockpits. 

Note  that  unlike  the  gradient  space  relations, 
however,  these  relations  are  not  automatically  reversible. 
That  is.  a  given  sensor  configuration  predicts  definite 
image  phenomena,  but  a  given  image  phenomenon  does 
not  necessarily  imply  a  sensor  configuration.  However,  if 
the  phenomenon  can  he  environmentally  labeled 
accurately  (i.e.  “horizon  ",  “vertical  vanishing  point") 
then  the  implications  about  the  sensor  are  correct.  In 
general,  though,  this  labeling  must  be  done  heuristicallv, 
as  described  below. 

5  Using  the  Gradient  Space  Relations 

The  relationships  described  above  can  be  exploited 
in  many  ways.  For  example,  given  the  sensor 
configuration,  one  can  recover  an  environmentally  labeled 
gradient  space  map  as  in  Figure  3.  If  the  sensor 
configuration  is  uncertain,  the  gradient  spare  map  (more 
simply,  the  gradient  of  a  properly  labeled  horizontal 
surface)  can  be  used  to  help  calibrate  tilt  and  roll. 

(It  should  be  noted  that  the  pan  parameter  can  not 
recovered.  In  a  sense,  pan  is  “gravity  invariant”.  That 
is,  there  is  no  information  in  an  environmentally  labeled 
gradient  space  map  that  would  indicate  pan.  Fan  does 
not  even  have  any  common  environmental  names. 
Perhaps  the  closest  terms  would  be  those  used  to  describe 
compass  directions;  “north-by-northwest”,  etc.  However, 
the  magnetic  force  on  which  they  are  based  seem  to  have 
negligible  environmental  influence;  lew  natural  systems 
appear  capable  of  detecting  it.  The  natural  world  does 
not  seem  to  have  a  strong  loft-right  preference.  For 
example,  although  it  is  nearly  impossible  to  find  a 
newspaper  photograph  that  has  been  printed  upside- 
down,  it  is  not  unusual  to  find  one  that  has  been 
“flopped”  left-for-right.  Nevertheless,  there  rnav  be 
artificial  environments  in  which  it  would  be  useful  to 
augment  a  mobile  robot’s  gravity  sensor  with  a  compass.) 

Additional  uses  of  these  relations  include  the 
following.  If  the  sensor  parameters  are  known,  then  the 
determination  that  a  given  surface  is  horizontal  uniquely 
specifies  it  gradient.  The  determination  that  it  is  vertical 
creates  a  linear  constraint  in  the  gradient,  space  on  which 
its  gradient  must.  lie.  This  constraint  can  be  used  with 
any  other  gradient  space  constraints:  for  example,  those 
obtained  by  shape  from  shading,  skewed  symmetry,  or 
shape  from  texture. 
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If  the  sensor  parameters  are  ti n known  then  a 
determination  that  two  11011-parallel  surfaces  are  vertical 
yields  tilt  and  roll  their  gradients  generate  the  line  of 
verticals  in  the  gradient  space.  The  determination  of  a 
single  surface  being  vertical  constrains  tilt  and  roll  to  one 
degree  of  freedom;  horizontal  surfaces  must  lie 
perpendicular  to  it 

5.1  A  Hough-like  Algorithm 

Suppose  we  pose  the  inure  difficult  problem  in 
which  there  is  neither  a  labeling  nor  sensor  information. 
Nevertheless,  both  can  still  be  (henristically J  recovered. 
Consider  the  additional  assumption  that  all  (or  most) 
surfaces  are  either  horizontal  or  vertical— an  assumption 
often  supportable  in  man-made  environments.  Then  the 
gradient  space  representation  (or  the  Gaussian  map )  of 
the  surfaces  in  the  scene  ran  be  analyzed  for  the  presence 
of  the  characteristic  point-of-horizontal/line-of-vcrticals 
configuratioii.  This  need  not  he  an  actual  search  for  the 
line  of  verticals,  although  there  may  he  some 
environments  in  which  this  is  an  efficient  thing  to  do. 
Instead  it  can  he  achieved  using  a  type  of  Hough 
accumulator  approach. 

In  broadest  outline,  this  method  has  all  existing 
surfaces  vote  for  candidate  horizontal  surfaces.  Once 
voting  is  done,  the  surface  with  the  most  votes  is  then 
presumed  to  he  horizontal.  Sensor  tilt  and  roll,  and  the 
ine  of  verticals  are  easily  determined. 

\oting  is  proscribed  in  the  following  way.  Since  a 
given  surface  is  likelv  to  be  either  horizontal  or  vertical,  it 
votes  once  for  itself  since  it  may  itself  be  horizontal. 
However,  since  it  may  also  be  vertical,  it  votes  once  for  all 
surfaces  porpen  licular  to  it:  in  t li is  case,  at  least  one  of 
these  surfaces  must  be  the  horizontal.  Graphically,  this  is 
displayed  in  I  igure  5.  \  vote  for  the  self  is  shown  by  a 

circle  about  the  point;  a  vote  for  all  perpendicular*  is 
indicated  by  a  dashed  line,  In  the  example,  only  one 
surface  has  received  four  votes;  it  is  assumed  to  be 
horizontal. 


Figure  5:  Hough  scheme  for  finding  ground  planes. 

5.2  A  Critique  of  the  Algorithm 


This  method  is  not  without  its  problems,  but  it  does 
have  a  virtue  or  two.  I'he  problems  are  manifest.  Most 
critically  any  such  weighting  scheme  is  heavily  dependent 
on  tin'  given  gradient  space  map,  which  in  turn  is  affected 


by  the  environment  and  by  the  sensor  position.  Thus  if 
thcie  i  only  one  surface  present,  or  even  if  there  are  two 
mutually  perpendicular  ones,  there  are  no  grounds  by 
which  to  label  nnvthing  horizontal  If  here  are  two  non- 
)erpciuliciilar  surfaces  present,  the  method  considers  them 
Kith  vertical  to  a  common  horizontal  (which  does  not 
appear  in  the  gradient  space.)  If  there  are  multiple 
surfaces,  the  voting  is  affected  bv  the  way  in  which  the 
multiple  surfaces  have  been  recorded  in  the  gradient  space 
map:  perhaps  this  map  itself  has  been  weighted.  Lastly, 
the  method  is  subject  to  the  time  and  space  problems  that 
all  Hough  methods  are  plagued  with:  the  space  must  he 
carefully  quantized  (a  problem  which  is  less  severe  on  the 
Gaussian  sphere),  the  line  of  votes  must  he  calculated 
votes  must  be  distributed  among  accumulators 
proportionately,  etc. 

But  the  method  does  have  some  justifications.  In 
Kirliciilar,  like  most  Hough  transforms  it  can  be  made 
leuristically  more  efficient,  and  it  is  likely  to  be  robust 
with  respect  to  noise— which  in  this  case  are  surfaces 
which  are  neither  horizontal  or  vertical.  Further,  it  works 
with  surfaces  that  are  curved  verticals:  building  support 
columns,  say,  or  draperv.  In  these  cases,  the  gradient 
space  map  of  the  vertical  surfaces  is  diffused  along  a  line. 
Nevertheless  the  voting  proceeds  accurately,  with  each 
small  quantum  of  the  diffusion  adding  its  small  votes  for 
its  own  perpendiculars.  Perhaps  most  interesting  is  the 
result  that  the  horizontal  can  he  found  even  if  there  is  no 
direct  evidence  for  it  in  the  gradient  space:  the  ground 
can  be  '  seen"  even  t hough  it  is  “not  there",  as  in  Figure 
(>.  This  occurs  when  many  environmentally  vertical 
surfaces  all  vote  for  their  perpendiculars;’  the  one 
perpendicular  they  have  in  common  must  be  horizontal, 
whether  it  is  present  in  the  gradient  space  map  or  not. 


Figure  6:  "Seeing"  the  ground  plane. 

6  Using  the  Image  Space  Relations 

'I'lie  relations  concerning  imn^c  configurations  can 
also  he  exploited  in  many  ways.  Idle  simplest  case  is 
when  all  sensor  information  is  known  One  immediate 
result  is  that  locations  of  both  the  horizon  and  the 
vanishing  point  of  verticals  are  then  also  known  whether 
or  not  any  phenomena  suggesting  them  actually  appear  in 
the  image.  (If  either  tire  suggested  by  an  image 
configuration,  then  that  configuration  can  he  assumed  to 


image  of  I  lie  vertical  plane  passing  through  the  focal 
point.  (With  no  roll  tins  image  is  the  y-axis).  One  figure 
suffices  for  both  the  orthographic  and  special  perspective 
cases  because  under  orthography  any  image  can  be 
translated  to  the  origin  without  affeeting  the  gradient 
space. 


the  horizontal  plane  at  45°;  this  angle  is  commonly  used 
in  architectural  drawing  (Morgan  50], 

The  second  property  is  that  under  pure  orthography 
all  right  angles  with  a  vertical  side  behave  identically,  in 
one  respect.  Distances  on  the  horizontal  plane  on  which 
they  stand  can  be  read  off  from  the  iniape.  independently 
of  the  angle  a  that  their  images  form.  Consider  Figure  0. 
Lot  the  vertex  be  at  relative  depth  z  =  0;  distance 
increases  towards  the  observer.  Draw  the  retinally 
horizontal  lint  v  ==  cota.;  the  segment  intercepted  by  the 
angle  is  of  lengili  1.  The  total  depth  at  the  left  intercept 
is  z  —  coto/qjj,  since  the  vertical  line  has  no  p  component 
and  it  increases  in  depth  proportionally  to  the  t tie  sensor 
tilt.  The  total  depth  at  the  right  intercept  is  the  depth  at 
the  left  plus  the  pure  p  component  depth  increase  due  to 
a  movement  of  1  image  unit  to  the  right.  Thus,  the  depth 
at  the  right  is  cota/q,  -  cota(q,  +  l/q,  )  =  -cotaq,„  using 
the  function  relating  p  to  qi  . 


Figure  7:  Simplest  Kanadc  hyperbola:  image. 

The  resultant  constraint  equation  is  still  a 
hyperbola,  but  it  is  extremely  simple:  it  is  p  = 
-  colo(q+l7q),  with  i>  now  a  one-to-one  function  of  q.  As 
shown  in  Figure  this  constraint  is  uniquely  intercepted 
by  the  line  of  vertical  surfaces  q  =  -1/q.  tor  any  value 
of  .  Tims,  under  orthographic  conditions  the  gradient 
of  tTie  generated  vertical  surface  is  uniquely  defined. 
(Note  that  if  the  vertical  tine  is  nr.  object  edge  and  the 
horizontal  line  is  the  edge’s  shadow,  then  this  gradient 
constrains  the  direction  of  the  illuminant:  see  [Shafer 
*ib].J 


Figure  8:  Simplest  Kanade  hyperbola:  gradient  space. 

This  special  case  hyperbola  has  several  interesting 
properties.  The  first  is  that  the  minimum  value  of  p 
always  occurs  at  q  =  -1,  independent  of  a.  Since  in 
orthographic  photographs  there  is  no  indication  of  the 
sensor  tilt,  qi,.  the  observer  is  free  to  select  a  tilt  at  will. 
The  choice  of  qjt  =  -1  guarantees  that  left-right  slant  is 
minimized  (i.e.  the  surface  "regresses  to  the  frontal 
plane  )  This  value  of  q  is  equivalent  to  looking  down  at 


Z=t0t'“V  j|— =  -COta(VM,h) 


Figure  6:  High!  angle  depths:  image  calculations. 

This  can  all  be  summarized  by  stating  that  on  any 
line  v  =  c  the  depth  on  the  vertical  leg  is  c/qh  and  the 
depth  on  the  horizontal  leg  is  cqi  ,  independent  of  a.  In 
particular,  any  rectangular  prism  of  whatever  size,  resting 
on  a  horizontal  surface  with  known  tilt,  can  be  easily 
labeled  for  relative  depth  in  the  image  itself,  starting  at 
any  vertex  and  propagating  depth  changes  outwards:  see 
Figure  10. 

z=-cq,+d/q. 


Figure  10:  lfiglit  angle  depths:  propagatk: 


(An  alternate  derivation  is  shown  in  the  side  view  of 
figure  11  Along  the  plane  of  constant  depth,  at  c  units 
above  the  vertex,  the  environmental  vertical  has  depth 
change  c/q^  by  similar  triangles.  Similarly,  the 
horizontals  is  ccm  This  side  view  also  indicates  the 
independence  of  depth  calculations  with  respect  to  the 
image  of  the  right  angle;  all  that  matters  is  the  relative 
height  in  the  image  plane,  and  the  environmental  labels  of 
vertical  line  or  horizontal  plane.) 


plane  af 


Figure  11:  Right  angle  depths:  side  view. 

8  Ambiguity  in  Labeling 

In  the  previous  section,  we  gave  algorithms  for 
exploiting  the  perpendicularity  that,  arises  between 
horizontal  and  vertical  lines.  We  demonstrate  here  that 
that  configuration’s  power  comes  not  from  the 
perpendicularity  per  se,  nor  even  from  the  fact  that  the 
surface  that  is  formed  is  vertical,  but  from  their 
individual  environmental  labels.  We  show  this  by 
demonstrating  that  two  general  perpendicular  lines  even 
within  a  environmentally  vertical  plane,  give  rise  to 
ambiguous  surface  orientations.  In  this  discussion,  we 
make  the  simplest  of  assumptions;  orthographic  imaging 
with  known  sensor  parameters  and  no  roll;  basically,  this 
is  a  counter-example. 

Consider  Figure  12;  it  is  the  image  of  an 
environmentally  vertical  plane  in  which  there  is  embedded 
a  right  angle.  Neither  side  of  the  angle  is  environmentally 
horizontal  or  vertical,  however.  The  constraint  in  the 
gradient  space  that  the  image  generates,  from  the 
assumption  that  it  is  environmentally  right,  is  again  the 
Knnade  hyperbola:  see  Figure  13.  However,  because  of 
the  oriental  ion  of  the  image  angle,  this  hyperbola  is  no 
longer  a  function.  Further,  some  values  of  q  have  no 
corresponding  p;  certain  lines  of  vertical  surfaces  would 
not  intersect  this  constraint  curve.  This  is  another  way  of 
saving  that  some  values  of  sensor  tilt  qu  are  incompatible 
with  the  interpretation  of  the  image  angle  as  a  right  angle 
in  a  vertical  plane.  (This  was  not  so  in  the  case  with 
environmentally  labeled  sides.)  Worse,  nearly  every  line 
of  vertical  surfaces  that  does  intersect  it  intersects  it 
twice.  That  is  for  nearly  every  sensor  tilt  for  which  the 
image  has  an  interpretation  as  a  vertical  plane,  it  has  two 
possible  gradients. 

It  turns  out  that  the  two  interpretations  are 
somewhat,  difficult  to  visualize.  Perhaps  the  best  way  t.o 
view  them  is  with  a  physical  construction,  rather  then  by 
studying  Figure  Id  The  case  where  the  line  of  verticals  is 
tangent  to  the  hyperbola  probably  corresponds  to  the 
configuration  where  both  legs  are  at  45°  to  ihe 
environmental  horizontal. 


Figure  12:  Ambiguous  vertical  planes:  image. 

q 


Figure  13:  Ambiguous  vertical  planes:  gradient  space. 


Figure  14:  Ambiguous  vertical  planes:  interpretations. 


9  Discussion:  Meta-heuristics 

Although  some  of  the  relationships  discussed  iti  this 
>fip<V  teen  absolute,  innnv  of  them  depended  on 

leui'islic  assn illations.  Most  of  the  assumptions  were  of  a 
similar  form.  I  lie  oasic  reasoning  was  as  follows. 

Certain  preferred  environmental  objects  create 
specific  image  configurations;  for  example,  the  images  of 
environmentally  vertical  lines  converge  to  a  vanishing 
point  However  other  environmental  objects  could  also 
create  the  same  configuration;  for  example, 
cnvtromncnlalh  horizontal  or  oblique  lines  could  also 
converge  on  the  same  vanishing  point.  The  heuristics 
throughout  assumed  that  the  image  configurations  could 
bo  uniquely  inverted  ns  to  cause;  here,  convergence 
implies  environmentally  vertical.  More  simply,  the 
presence  of  an  image"  feature  similar  to  a  preferred 
object  s  image  features  was  taken  as  evidence  for  the 
preferred  object.  This  “near  implies  preferred"  meta- 
heuristic  has  proven  useful  in  several  other  contexts, 
specifically  shape  from  texture  and  skewed  syinmctr 

What,  sorts  of  environmental  objects  are  preferred? 
One  basis  for  preference  is  the  simplicity  with  which 
image  signatures  can  be  inverted.  For  example,  in  the 
gradient,  s^.ice,  both  horizontal  and  vertical  surfaces  are 
ea.sv  to  manipulate  because  their  classes  are  small  and 
wefl-defiiied;  oblique  surfaces  are  not.  Horizontal  and 
vertical  surfaces  are  therefore  preferred.  This  meta- 
henristic  that  preferred  implies  simple’  has  also  proven 
useful  in  other  contexts. 

lint  perhaps  the  most  evidence  of  the  utility  of  the 
meta-heuristics  is  in  the  suggestion  that  foveation  serves 
purposes  other  than  an  increase  in  resolution.  Viewing 
perpendicularities  off-axis  under  perspective  leads  In 
difficult  mathematics;  fovea  ting  them  makes  the  math 
v.ory  simple.  I  hus,  foveation  helps  to  create  simple 
signatures  and  helps  define  a  preferred  object.  The 
implication  for  image  understanding  might  be  t In*  all 
near- perpendicularities  should  be  foveateef;  they  might  be 
the  linages  of  an  easily  determinable  local  vertical 
surfaces. 
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— —y  7  his  paper  adapts  Horn  and  Schunek  jt  work  an  optical  'flow  [3]  to 
the  problem  of  dtterminin ,t>  arbitrary  motions  of  objects  from  2- 
dimcnsional  image  sequences.  The  method  allows  fur  gradual  changes 
in  the  way  (in  abject  appears  in  the  image  sequence,  and  allows  for  flow 
discontinuities  at  object  boundaries,  -Ifp  find  velocity  fields  thm  give 
estimates  of  the  velocities  of  objects  in  the  image  plane.  These  velocities 
arc  computed  front  a  series  of  images  using  t:\fonnalion  about  the  spatial 
and  ten  pond  brightness  gradients.  4  constraint  on  the  smoothness  of 
motion  within  an  object 's  boundaries  is  used,  The  method  can  be  applied 
to  mteipretation  uf  both  rejlectance  ami  x-ray  images.  Results  are 
shown  for  models  of  ellipsoids  undergoing  expansion,  as  well  as  for  an 
x-ray  image  sequence  of  a  beating  heart, 
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Introduction 

Interpreting  the  motion  of  objects  from  a  sequence  of  images  is 
difficult  because  image  changes  may  be  due  to  a  number  of  factors, 
hirst,  image  changes  may  he  due  to  object  translations  or  rotations,  or 
to  relative  motion  of  one  object  such  that  it  occludes  another.  Second, 
changes  may  occur  when  non-rigid  objects  change  shape  or  size. 
I  bird,  parts  of  an  image  need  not  change  even  though  they 
correspond  to  a  moving  object;  for  example,  regions  of  an  image 
corresponding  to  Hat  surfaces  ot  constant  reflectance  may  exhibit  no 
change  if  the  object  undergoes  only  translation.  1-out  th,  changes  may 
Jesuit  from  motion  of  the  observer,  'nuts,  effective  algorithms  that 
measure  object  motion  from  sequences  of  images  should  do  two 
things: 


Horn  and  Schunek  [3j  addressed  a  problem  of  computing  optical 
l  ow  from  an  image  sequence.  They  define  optical  flow  as  "the 
disti 'button  of  apparent  velocities  of  movement  of  brightness 
patterns  m  a  sequence  of  images.  Usually  optical  How  refers  to  the 
How  of  the  imaged  world  across  the  retina  as  a  biological  observer 
moves  continuously  through  the  world.  However,  if  we  assume  a 
stationary  viewer  and  assume  there  arc  no  changes  in  the  brirhtness 
patterns  as  a  result  of  the  motion,  then  Horn  and  Schunck’s  definition 

0  °ptlC‘.',1  now  givcs  thc  «f  objects  projected  onto  the  image 

plane,  fo  say  that  there  arc  no  changes  in  thc  brightness  patterns 
means  that  tire  image  brightness  corresponding  to  n  single  physical 
point  on  an  object  is  die  same  from  one  frame  to  thc  next  This 
restriction  permits  only  translation  of  objects  parallel  to  the  image 
pliT  and  does  not  allow  arbitrary  rotations  or  perspective 
it, msf ot mations.  In  order  to  compute  optical  flow,  Horn  and  Schunek 
assumed  that  the  velocities  varied  smoothly  over  thc  entire  image 
Ibis  assumption  has  limited  utility  in  real  images  where  object 
boundaries  are  usually  places  of  discontinuous  velocity  for  both  thc 
case  of  a  moving  object  and  for  an  observer  moving  with  respect  to  a 
static  scene. 

Our  approach  also  involves  computing  velocities  at  thc  points  in 
in  image,  but  our  method  differs  from  Hour  and  Sdiunrk's  in  two 
important  ways.  First,  the  the  velocity  smoothness  constraints  arc 
applied  only  within  regions  that  are  separated  from  the  rest  of  die 
image  by  recognizable  boundaries.  Velocities  arc  free  to  change 
abruptly  across  these  boundaries.  Second,  changes  in  the  hriehtness 
patterns  arc  allowed  so  that  velocities  more  closely  represent  die 
arbitrary  motions  of  objects  projected  onto  the  image  plane.  For 
example,  gradual  shading  changes  that  occur  with  rotation  relative  to 
the  light  source  may  be  accommodated. 


®  Hrcy  should  distinguish  between  image  changes  due  to 
motion  of  objects,  due  to  deformation  of  objects,  and  due 
to  occlusion. 

♦  They  should  determine  whether  regions  of  an  image  that 
exhibit  no  apparent  brightness  changes  correspond  to 
moving  siirliices. 

This  paper  develops  mcdiods  for  assigning  velocities  to  image 
points  by  examining  changes  in  brightness  at  each  point  in  a  sequence 
of  images.  While  many  of  die  techniques  may  be  applicable  to 
environments  where  die  observer  is  moving,  the  emphasis  will  be  on 
interpreting  image  sequences  where  die  observei  is  stationary  and  only 
objects  move.  In  general,  we  must  notice  that  motion  analysis  from 
images  cannot  be  solved  without  making  assumptions  about  die 
underlying  motion  ot  objects  represented  ii  the  image  sequence. 


lire  methods  developed  arc  applied  to  models  of  ellipsoids 
undergoing  expansion  and  to  x-ray  image  sequences  of  a  beating 
heart.  In  the  latter  case  the  pattern  changes  or  interest  are  those  dial 
occur  vv  oen  the  heart  changes  shape  in  a  direction  perpendicular  to  the 
image  plane. 


Problem  Statement 

II  there  is  no  a  prioii  knowledge  about  die  strueiurc  of  objects  in 
a  scene,  then  measurement  of  velocity  relies  on  local  information 
about  temporal  and  spatial  gradients  of  image  brightness.  This  local 
information  prov  ides  only  one  constraint,  the  change  in  hrigluncss  at  a 
given  point,  while  die  velocity  of  a  point  in  an  image  has  two 
components.  In  simple  situations,  where  moving  objects  only  undergo 
translation  parallel  to  die  image  plane  without  changing  their  pattern 
"i  die  image,  dris  constraint  determines  the  component  of  velocity 
parallel  to  the  brightness  gradient.  When  the  brightness  gradient  is 
zero  in  die  direction  of  motion  (eg.  fiat  region  of  an  object  with 
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constant  reflectance  or  a  stripe  pattern  in  the  direction  of  motion), 
then  there  is  no  local  velocity  information.  In  all  cases  additional 
constraints  must  be  imposed  to  determine  the  two  components  of 
velocity  in  the  image  plane  as  well  as  to  determine  the  changes  in  the 
image  pattern. 


constrained  the  local  change  in  velocity  by  minimizing  the  square  of 
the  magnitude  of  tire  spatial  gradient  of  the  velocity  components: 


£>>'♦&>'♦  <-V+<-V 

3a  9  y  dx  9  y 


(4) 


Let  the  image  brightness  projected  by  a  point  on  a  moving 
object  at  it  time  t  be  given  by  l(x.y.t).  At  a  later  time  l-f-Jl  the  same 
object  point  has  moved  so  that  its  p.ojcctcd  position  in  the  image 
plane  i;  given  by  (x+tlx.y+dy).  llic  brightness  of  this  point  may 
have  changed  Ui  a  value  l(x  h  Jx,y  +  Jy.t  +  ill)-  Such  a  change  occurs 
when  lighting  and  shading  change  ns  an  object  rotates  or  when  die 
object  itself  changes  shape.  The  total  rale  of  change  of  brightness 
ill/ril  is  given  by: 

dl  9/  ,/v  9/  ,ly  9/ 
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dl  dx  til  dy  dt  dl 


In  order  to  sohe  for  the  optical  flow  v  and  v  Horn  and 
Scluinck  combined  the  two  assumptions  (the  zero  hriglitness  change 
and  the  smoothness  constraint)  by  minimizing  the  following  function: 

/  [(-)  f  «  V)]  Ixtly  (5) 

til 

where  the  integral  is  over  the  entire  image  and  «2  is  a  weighting  fictor 
dial  depends  on  the  noise  in  the  gradient  measurements.  The 
following  iterative  formulae  provide  the  solution  for  die  How  velocities 
that  niinimiz.es  equation  (5): 


where  9//9.V  and  dl/dy  arc  the  x  and  y  components  of  the  spatial 
brightness  gradient  and  dl/di  is  the  temporal  hriglitness  change 
measured  at  the  point  (x,j).  The  three  variables  that  are  to  be 
determined  are  the  v  and  y  components  of  velocity,  i.e.  tlx/di  and 
dy/dl.  respectively,  and  the  brightness  change  dl/dl.  To  simplify  the 
notation,  we  introduce  the  abbreviations  /  /  and  /  for  the  partial 
derivatives  of  brightness  with  respect  to  a,  y  and  /  and  the 
abbreviations'  and  v  for  the  v  and  y  velocity  components,  liquation 
(l)c.in  then  be  rewritten  in  the  following  way: 
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To  solve  this  equation  for  the  velocities  (v.v)  and  the  rate  of 
brightness  change  (dl/di).  other  constraints  must  be  applied  that 
restrict  the  allowable  motions.  For  example,  Hie  assumption  can  be 
made  that  the  velocity  and  pattern  changes  are  constant  or  that  they 
change  smoothly  within  a  region.  It  could  also  be  assumed  that  die 
velocities  and  patterns  vary  in  a  constrained  manner  over  lime  (4). 


in  die  next  section,  we  review  I  lorn  and  Schmidt's  method  for 
computing  optical  How,  and  identify  problems  with  it.  The  remaining 
sections  propose  a  set  of  modifications  and  extensions  to  cope  with 
drove  problems.  First,  we  present  a  technique  that  oormits  velocity 
flow  discontinuities  at  boundaries.  Then  we  suggest  a  way  to 
accommodate  some  of  die  changes  in  brightness  patterns  that  occur  as 
a  result  of  motion.  The  final  section  presents  results  obtained  by 
applying  the  modifications  to  a  model  of  an  expanding  ellipsoid  and 
an  example  that  incorporates  all  of  these  techniques  to  analyze  heart 
motion  from  a  sequence  o  x-ray  images. 


Horn  and  Schunck’s  Method  for  Computing 
Optical  Flow 

Horn  and  Schmidt  [3]  assumed  no  pattern  change  in  die  image 
so  that  die  brightness  change  witli  time  corresponding  to  a  single 
physical  point  dl/di  is  equal  to  zero,  i.e.: 
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where  vj1  and  f*  denote  load  averages  of  die  velocity  components 
computed  at  the  Adi  itenuiow  *  region  where  dierc  is  no  apparent 
local  velocity  information  (eg.  Hat  region  of  constant  reflectance)  will 
derive  its  velocity  from  the  surrounding  region,  because  during  die 
iterative  process,  velocities  will  tend  to  propagate  and  fill  in  these 
regions. 


There  arc  three  primary  problems  with  this  technique.  The  first 
two  involve  the  boundaries.  First,  die  technique  does  poorly  when 
there  are  discontinuities  in  the  velocity  Held  or  in  die  brightness 
gradients,  because  of  die  smoothness  assumption,  flic  discontinuities 
occur  at  ohjcct  boundaries.  Second,  die  same  property  that  allows 
velocities  to  propagate  within  an  object  tends  to  extend  erroneous 
velocities  outside  the  area  of  an  object.  The  problem  is  most 
conspicuous  for  the  ease  where  an  ohjcct  is  moving  against  a  uniform 
background,  in  this  ease  it  is  not  possible  to  distinguish  die  velocity  of 
die  object  from  the  velocity  assigned  to  the  uniform  background. 
'I  bird,  motion  is  constrained  to  be  parallel  to  the  image  plane  because 
of  die  assumption  that  an  object  does  not  change  die  way  it  appears  in 
Ihc  image  from  frame  to  frame. 

Boundary  Constraints 

'Hie  previous  section  suggests  that  discontinuities  in  velocity 
which  occur  at  object  boundaries  must  lie  explicitly  accounted  for  in 
order  to  accurately  determine  velocities  within  die  boundaries.  We 
propose  to  allow  for  these  discontinuities  by  applying  die  smoothness 
constraint  separately  to  legions  on  either  S'dc  of  an  image  boundary. 
'Ill's  cun  be  done  once  Hie  projection  of  die  object  boundaries  have 
been  located  in  die  image.  As  we  see  next,  implementation  docs  not 
require  that  the  image  be  segmented  into  regions  corresponding  to 
objects,  rndicr  only  diat  the  location  of  possible  object  boundaries  be 
determined. 


This  assumption  severely  limits  dm  allowable  motions.  Rotations, 
translations  in  cpili  and  deformations  often  result  in  changes  in  the 
image  hriglitness  pattern  and  violate  this  assumption.  Horn  and 
Sehuiiek  made  die  additional  assumption  that  neighboring  points  have 
similar  velocities.  To  implement  diis  smoothness  constraint,  dicy 


Image  boundaries  occur  when  one  object  moves  in  front  of 
another,  these  arc  called  occluding  boundaries,  image  be;  ndarics  can 
also  occur  due  to  the  painted  patterns  or  non-occluding  edges  on  the 
object:  these  arc  non-occluding  boundaries.  (There  arc  also 
boundaries  due  to  object  shadows,  but  these  arc  not  explicitly  dealt 
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with  here.)  In  icrms  of  motions  across  diem  there  is  an  important 
difference  between  occluding  and  non-occluding  boundaries.  A  non¬ 
occluding  boundary  lias  consistent  motion  on  both  sides  -  there  is  no 
velocity  discontinuity.  Hie  regions  on  both  sides  of  an  occluding 
boundary  can  have  different  velocities.  We  must  process  velocity  How 
data  at  a  boundary  diifercntly  according  to  the  type  of  boundary.  Hie 
smoothness  constraint  is  enforced  across  non-occluding  boundaries, 
but  not  across  occluding  boundaries.  This  procedure  permits  spatial 
discontinuities  in  Mow  velocity  to  occur  when  one  object  moves  in 
front  of  another. 


Hie  first  factor  is  the  error  in  satisfying  equation  (2),  i.c. 
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Since  we  allow  dl/dl  to  be  nonzero,  it  is  included  in  e,2.  Hie  second 
error  factor  is  a  measure  of  the  departure  from  a  spatially  smooth 
velocity  field,  t<?\  which  is  die  same  as  equation  (4).  The  third  error 
factor  is  given  l,y  cquation(8)  and  measures  the  departure  from  a 
spatially  smooth  pattern  change. 


To  apply  this  method,  we  need  not  predetermine  whether  a 
boundary  is  occluding  or  non-occluding.  First,  the  nearby  velocities 
arc  computed  based  on  an  assumption  of  non-occlusion;  die 
smooth  ness  constraint  is  applied  across  die  boundary.  Next,  die 
velocities  arc  recalculated  assuming  occlusion;  die  smoothness 
constraint  is  not  enforced  across  the  boundary.  Finally,  the  result  that 
best  satisfies  the  equation  for  dl/dl  (equation  (2))  and  the  smoothness 
constraint  is  retained.  In  this  way,  the  boundary  types  can  be  locally 
determined  without  explicit  segmentation  of  the  image  into  object 
regions.  This  test  is  repeated  widi  cacli  iteration. 


We  minimize  die  sum  rf  these  error  factors  computed  over  die 
image; 

minimize  e/(i,j)  ♦  a2  r/(i.j)+  fl?  e./(i.j)  (10) 

i  J 

An  iterative  form  of  die  solution  is  found  for  the  velocities  at  die 
(k  i  I)  iteration  in  terms  of  die  spatial  and  temporal  biigiilness 
gradients  and  die  neighboring  velocities  at  the  k-ilt  iteration: 


Pattern  Changes 

A  pattern  change  refers  to  the  change  in  image  brightness  of  the 
same  physical  point  on  an  object  from  one  frame  to  the  next.  A 
pattern  change  will  occur  when  points  on  die  object  arc  obscured  or 
revealed  in  successive  image  frames.  This  type  of  change  causes 
discontinuities  in  the  velocity  across  occluding  boundaries.  These 
changes  have  been  accommodated  by  die  method  in  a  previous 
section. 

I  here  is  also  another  type  of  pattern  change.  For  example, 
when  an  object  rotates  and  the  lighting  hits  die  object  in  a  different 
way,  it  results  in  different  shading.  For  a  Lambertian  surface  Ihc 
shading  change  of  a  given  physical  point  on  an  object  is  given  by: 
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where  t  is  the  angle  between  the  incident  light  am1  the  the  surface 
normal,  and  A:  is  a  constant.  If  the  surface  orientation  is  known,  then 
dl/Ji  gives  a  measure  of  the  change  in  orientation. 

Here  we  propose  to  allow  for  such  pattern  changes  in  the  image 
by  constraining  them  to  vary  smoothly  within  boundaries.  We  can 
think  of  die  pattern  change  ( dl/<1i )  as  another  velocity  component. 
While  dl/di  is  not  strictly  a  velocity,  we  constrain  die  variations' in 
dl/dl  to  vary  smoothly  within  object  boundaries,  just  as  was  done  for 
the  velocity  components.  Thus  we  can  define  a  smoothness  measure  of 
change  in  brightness  variation: 


where  and  vk  denote  averages  of  the  neighboring  velocities 
at  die  k-th  iteration  and  (dl/dif  denotes  the  average  pattern  change  at 
the  k-ih  iteration.  This  iterative  procedure  is  applied  everywhere  in 
the  image,  but  points  in  die  neighborhood  of  a  boundary  are  treated 
diifercntly.  Boundaries  arc  located  by  finding  zero  crossings  in  the 
I  aplaci.in  of  brightness  [I  ]  in  each  of  a  sequential  pair  of  images  and 
forming  a  union  of  such  zero  crossings.  The  size  ofa  neighborhood  is 
determined  by  die  size  of  the  region  over  which  die  smoothness 
constraint  e  v  is  computed.  Velocities  arc  computed  separately  using 
points  in  die  neighborhood  on  one  side  of  die  boundary  and  again 
using  points  in  the  neighborhood  dial  span  the  boundary.  This  yields 
two  different  estimates  Tor  die  velocity.  The  estimate  diat  minimizes 
t/  +  a2  ts2  *  ft2  t2  is  used. 


y  =  [g-(7)]2+Ry)]2 

ox  dt  d y  dl 
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Now  Algorithm 

Now  we  can  present  our  new  algoridim  which  incorporates  the 
considerations  on  boundaries  and  pattern  changes.  To  summarize, 
this  algoridim  assumes:  (a)  the  brightness  changes  ofa  single  physical 
point  can  be  described  by  the  first  order  expansion,  equation  (2);  (b) 
velocity  changes  in  a  neighborhood  arc  similar,  unless  the 
neighborhood  contains  an  occluding  boundary;  (c)  the  rate  of  pattern 
change  (dl/dl)  is  also  similar  in  a  neighborhood.  To  impose  diese 
assumptions  we  define  an  error  factor  for  each. 


Results 

Model  cf  Expanding  Ellipsoid 

I  lie  algorithm  described  in  diis  paper  was  tested  with  a  sequence 
of  images  generated  by  modelling  an  ellipsoid  that  expands  uniformly 
in  all  directions.  I  he  ellipsoid  is  assumed  to  have  Lambertian  surface 
properties  and  to  be  illuminated  with  a  distant  source  perpendicular  to 
die  image  plane.  The  image  is  resolved  to  64  by  64  pixels  and 
quantized  to  256  brightness  levels  (see  Figure  1A).  ITic  maximum 
velocity  of  nny  point  in  the  image  is  approximately  0.5  pixels  per 
frame.  1  lie  background  is  uniform  and  therefore  provides  no 
information  about  motion.  The  actual  velocity  vectors  for  the 
expanding  ellipsoid  arc  shown  in  Figure  IB.  Ihesc  arc  the  results  we 


would  like  to  obtain  using  our  algorithm. 

Figure  2  shows  Lite  results  of  applying  Horn  and  Schunck's 
optical  How  technique  (equation  6)  to  the  expanding  ellipsoid.  Here 
the  smoothness  constraint  is  applied  across  object  boundaries.  While 
the  velocities  are  determined  fairly  accurately  within  the  object,  they 
arc  propagated  erroneously  beyond  the  object  boundaries.  The  total 
error  over  die  entire  image  is  approximately  15%.  When  velocity 
discontinuities  arc  taken  into  account  as  outlined  above,  a  more 
accurate  estimate  of  velocities  is  obtained  as  in  Figure  3.  We  see  dial 
use  of  boundary  information  results  in  a  dear  demarcation  of 
velocities  within  and  without  the  object.  The  residual  errors  do  not 
extend  substantially  beyond  the  boundaries  of  the  object.  The  total 
error  is  5%.  However,  the  algorithm  tends  to  overestimate  tnc  actual 
velocities  in  the  vicinity  of  the  boundary.  Such  inaccuracies  are 
expected,  because  of  the  discontinuities  in  die  brightness  gradient  diat 
occur  at  die  border  between  one  object  and  another.  One  way  to 
avoid  this  problem  and  possibly  improve  the  (low  velocity  estimates 
throughout  a  region  is  to  determine  velocities  at  die  boundaries  of  the 
region  using  another  technique  (sec  for  example  [2]).  Such  velocities 
at  the  boundary  provides  intitial  conditions  md  remain  fixed  in  die 
iterative  procedure.  To  see  this  effect,  die  actual  velocities  at  the 
boundary  were  supplied  as  initial  conditions  and  remained  fixed  for 
die  iterative  procedure.  The  result  is  shown  in  Figure  d.  The  total 
error  is  less  than  3%.  For  die  case  of  the  expanding  ellipsoid,  the 
velocities  inside  the  boundary  region  were  close  to  die  correct  values 
whether  or  not  the  initial  boundary  velocities  were  specified. 
However,  there  are  probably  other  cases  when  a  good  initial  guess  of 
velocity  at  the  boundary  will  substantially  improve  the  velocity 
estimates  inside  the  bounded  region. 

Application  to  X-ray  Images 

’lliough  tnese  techniques  have  been  developed  for  objects 
imaged  in  v  isible  light,  wc  have  begun  to  explore  application  of  these 
techniques  to  x-ray  images.  Our  goal  is  to  use  them  to  analyze  motion 
of  die  heart  from  cine  angiograms. 

When  optical  flow  techniques  arc  applied  to  x-ray  images,  die 
results  have  a  different  meaning.  At  each  point  in  an  x-ray  image,  the 
brightness  depends  on  die  amount  and  density  of  the  mass  between 
the  x-ray  sotre-’  and  die  film.  Hccausc  brightnesses  depend  on  object 
densities  instead  of  reflectance,  die  velocities  found  by  this  method  no 
longer  apply  to  single  physical  points  on  die  surfaces  of  objects.  For 
simplicity,  wc  assume  that  die  density  docs  not  change  in  time  and 


that  the  brightness  changes  therefore  represent  depth  changes.  In 
angiograms  where  radio-opaque  dye  is  injected  into  the  bloodstream, 
the  primary  x-ray  attenuators  arc  the  dye  and  calcined  bone,  l-'or  this 
case,  the  assumption  that  pattern  changes  reflect  changes  in  the  depth 

of  the  heait  is  accurate  since  the  dye  filled  heart  is  the  primary  source 
of motion. 

X-ray  images  have  two  advantages  over  reflectance  images  of 
opaque  objects.  First,  depth  information  is  available.  Second,  objects 
are  not  totally  occluded.  The 'disadvantage  is  that  a  point  in  tile  image 
docs  not  generally  correspond  to  a  single  point  on  a  single  object. 
Thus  the  flow  velocities  lake  on  ,t  different  meaning  for  x-ray  images, 
as  described  above.  To  understand  what  this  difference  means, 
consider  a  reflectance  image  and  an  x-ray  density  image  of  die  same 
object,  an  expanding  sphere.  (See  Fig.  5.)  in  the  retlcctancc  image 
the  brightness  due  to  a  single  physical  point  on  the  sphere  is  die  fame 
in  successive  images,  because  it  lias  die  same  surface  orientation. 
Therefore,  to  determine  velocity  at  a  given  point,  wc  need  only  find  a 
point  of  matching  brightness  dint  satisfies  die  global  smoothness 
constraint.  If  we  look  at  brightness  along  one  dimension  of  the  image, 
dien  the  velocities  are  found  by  matching  points  of  similar  brightness 
in  successive  frames.  (See  Figure  5A.) 

In  an  x-ray  density  image  which  records  die  z  height  as  the 
brightness,  such  a  simple  matching  of  similar  brightnesses  frequently 
docs  not  yield  sensible  velocities.  In  fact,  there  may  be  many  points  in 
one  image  for  which  dicre  is  no  matching  brightness  in  successive 
images.  As  shown  in  Figure  Mi.  matching  points  in  successive  Tames 
of  die  x-ray  image  of  a  sphere  based  on  similar  brightness  values, 
yields  very  large  velocities  near  die  densest  part  of  die  imaged  sphere 
(i.c.  center)  where  die  actual  velocities  arc  small.  A  meaningful 
description  of  die  motion  from  die  density  image  wotdd  be  obtained 
by  taking  the  rate  of  brightness  change  (ill All)  into  account.  This  can 
be  interpreted  as  a  change  in  depth  perpendicular  to  the  image  plane. 
(See  Figure  5C.)  For  x-ray  'linages  of  a  beating  heart,  die  brightness  at 
a  point  in  the  image  is  dependent  on  the  depth  of  die  heart  cavity 
perpendicular  to  die  image  plane.  Thus  die  pattern  changes  will 
reflect  die  expansion  or  contraction  movement  of  die  heart  in  the 
direction  perpendicular  to  die  image  plane. 


B 


Figure  1:  An  image  sequence  was  obtained  by  modelling  an  ellipsoid  Uiat  expands  uniformly 
in  all  directions.  One  frame  of  the  sequence  is  shown  in  (A).  At  each  point  in  the  image  we 
have  compu.cd  [he  magnilude  and  direction  of  the  local  image  velocity.  The  velocity  vectors 
at  each  point  in  the  image  are  plotted  here  as  short  line  segments  representing  magnitude  and 
direction.  The  correct  velocity  flow  patiern  determined  from  ihe  model  is  shown  in  (B). 
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■iKurc  2.,  Inc  velocity  vectors  for  the  expanding  ellipsoid  were  calculated  using  Horn  and 
Schunck  s  optical  flow  algorithm.  Ihc  resulting  How  pattern  is  shown  in  (A)  \'o  boundary 
constraints  were  imposed,  so  that  the  velocity  smoothness  const, amt  was  applied  across  the 

boundaries.  Ihc  result  is  that  the  velocities  propagated  outside  the  boundary.  A  vector  plot 
of  the  velocity  errors  is  shown  in  (li)  Initial  velocities  were  set  to  2cro.  Ihe  results  are  shown 
after  thirty-two  iterations. 


I  jj 


'Minin'- 


I'iRure 3:  Ihe  velocity  flow  pattern  tn  (A)  was  calculated  for  the  expanding  ellipsoid 
assuming  dial  flow  discontinuities  could  occur  at  the  boundaries.  The  boundaries  used  are 
indicated  by  heavy  biack  dots  at  the  base  of  some  of  the  velocity  vectois.  Velocity  errors  are 
shown  in  (II).  The  velocities  computed  at  the  boundaries  arc  substantially  greater  than  the 
actual  velocities,  however,  the  velocities  inside  the  ellipsoid  arc  very  close  to  the  actual 
velocities.  Again,  the  initial  velocities  were  set  to  rcro  and  the  results  arc  shown  after 
thirty-two  iterations. 


lugure  4:  The  veloeit;  flow  pattern  in  (A)  was  calculated  for  the  expanding  ellipso.d  with  the 
veloc-  "S  at  the  boundaries  setto  the  actual  values.  As  in  Figure  3,  discontinuities  in  velocity 
llow  j  permitted  al  the  boundary  The  boundaries  used  are  indicated  by  heavy  black  dots. 
The  put  in  (B)  shows  the  velocity  errors.  The  total  error  is  less  than  3%  after  thirty-two 
iterations 


REFLECTANCE  IMAGE 


Ficurc  5'  When  a  sphere  expands  in  a  reflectance  image,  the  profile  of  briEhuresscs  in  a 
cr^-scclion  parallel  to  the  x  axis  changes  as  shown  in  (A)  We  assume  a  distant  light some 
perpend, cular  to  the  image  plane  the  bnghtness  of  points  on  the  srnfacc  do  not .change  as Jhe 
sptec  expands  so  surface  motion  can  be  measured  by  match, ng  pomts  of  similar  balnea 
that  satisfy  the  smoothness  constraint.  When  a  sphere  expands  m  an  x-ray  density  image  the 
bnghtness  of  each  surfi.ee  point  increases  a,  show  „  m  Cl)  and  (C )  it  ,s  no  t.nger  possible  to 
determine  verities  t ,  ma'chmg  brightnesses  We  expect  the  velocities  to  be  the  same  as  o 
the  reflectance  case.  However,  if  we  simply  match  brightnesses  as  in  (P).we  obtain  very  la  ge 
r  “, burner  of  the  tmaged  sphere  where  we  expect  very  small  velocities,  as  we  I  as 
"now  discontinuity  at  the  center  of  the  sphere.  If.  as  in  ,C).  we  allow  for  smoothl 
varying  bnghtness  changes  «///<*)  in  addition  to  the  motion  in  the  plane  parallel  the 
image,  then  velocities  in  the  image  plane  are  as  cxpected. 


liii 
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■e  6'  The  expanding  ellipsoid  was  modelled  again,  but  the  brightness  at  each  point  in 
nage  now  corresponds  to  the  depth  o' the  ellipsoid  measured  perpendicular  to  the .unage 
tl*  j  ;«  nmtlar  m  an  v  rflv  imaeo  of  an  ellipsoid.  The  velocities  calculated  from 


laae  now  corresponds  to  tnc  ac pm  o:  me  cuipuiu  ».u.w.vu  f-f™ - ",  ’  - 

.  The  resuit  is  similar  to  an  x  ray  image  of  an  ellipsoid.  The  velocities  calculated  om 
-.del  arc  shown  in  (A)  and  the  rale  of  pattern  change  (dl/dl)  is  shown  in  (15)  The  pattern 
tes  can  be  thought  of  as  changes  in  depth. 
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ABSTRACT 


rv  r  >  fj  '  t 

We  present  a  procedure  for  processing  real  world 
image  sequences  produced  by  relative  translational 
motion  between  a  sensor  and  environmental  objects. 
In  this  procedure,  the  determination  of  the 
direction  of  sensor  translation  is  effectively 
combined  with  the  determination  of  the 
displacements  of  image  features  and  environmental 
depth.  It  requires  no  restrictions  on  the 
direction  of  motion,  nor  the  location  and  shape  of 
environmental  objects.  It  has  been  applied 
successfully  to  real-world  image  sequences  from 
several  different  task  domains. 

We-  then  consider  several  extensiuns  and 
applications  for  such  things  as  independently 
moving  objects,  translational  blur  streaks,  other 
cases  of  restricted  motion,  computation  in  a 
hierarchical  structure,  and  incorporation  into 
hybrid  sensor  systems  for  autonomous  navigation. 


0.0  INTRODUCTION 


This  paper  presents  a  procedure  for  processing 
translational  motion  in  image  sequences.  The 
computation  robustly  combines  the  determination  of 
the  translational  motion  parameters,  image 
displacements,  and  environmental  depth.  It  can  be 
used  as  a  basic  component  for  visually  guided 
navigation  since  the  other  parameters  ol  sensor 
motion  can  either  be  obtained  using  other 
associated  devices  or  removed  by  sensor 

stabilization.  In  addition,  we  discuss  extensions 
for  such  things  as  independently  moving  objects, 
translational  blur  streaks,  and  computation  in 
hierarchical  structures. 


The  basic  procedure  consists  of  two  steps:  Feature 
Extr  ac  tion  and  Search.  The  feature  extraction 
process  finds  small  image  areas  which  may 
correspond  to  distinguishing  parts  of  environmental 
objects.  The  direction  of  translational  motion  is 
then  found  by  a  search  which  determines  the  image 
displacement  paths  along  which  a  measure  of  feature 
mismatch  is  minimized  for  a  set  of  features.  The 
correct  direction  of  translation  will  minimize  this 
error  measure  and  also  determine  the  corresponding 
image  displacements  for  the  extracted  features. 

The  feature  extraction  process  finds  distinctive 
points  which  are  positioned  at  points  of  high 
curvature  along  contours  determined  by  simple 
processes  such  as  thresholding,  zero-crossing 
extraction  and  simple  local  contrast  measurements. 
Particular  forms  of  the  feature  extraction  process 
can  lead  to  effective  and  very  rapid  implementation 
on  proposed  image  processing  architectures. 

The  search  process  minimizes  an  error  measure 
defined  with  respect  to  a  unit  sphere  with  each 
point  on  the  sphere  corresponding  to  a  different 
direction  of  sensor  translation.  A  given  direction 
of  translation  constrains  the  motion  of  extracted 
image  features  to  straight  lines  which  radiate  from 
or  converge  onto  a  single  point  in  the  image  plane, 
The  error  measure  thus  associates  with  a  point  on 
the  unit  sphere,  corresponding  to  a  particular 
translational  axis,  a  number  describing  the  total 
extent  of  feature  mismatch  along  the  displacement 
paths  determined  by  the  translational  axis. 
Experiments  have  shown  this  error  measure  to  be 
smooth  and  with  a  distinct  minimum  in  a  large 
neighboorhood  about  the  correct  translational  axis. 
Because  of  this,  the  search  process  can  be  quite 
simple . 

Experiments  with  several  different  image  sequences 
indicate  that  the  procedure  is  very  robust  and 
applicable  to  a  wide  range  of  real  world 
situations.  We  also  review  particular  extensions 
for  implementing  the  procedure  in  a  hierarchical 
computational  framework,  dealing  with  independently 
translating  objects,  translational  blur-streaks , 
and  implications  for  autonomous  navigation. 


This  work  was  supported  by  grant  number 
N  0001 4 -82 -K -04  6  4  from  DARPA. 
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Model  of  X-ray  Image  of  Expanding  Ellipsoid 

We  show  nil  example  for  a  sequence  of  images  generated  by 
modelling  an  ellipsoid  that  expands  with  time.  I  he  brightness  is 
proportional  to  the  depth  of  the  ellipsoid  perpendicular  to  the  image 
plane.  I  he  ellipsoid  is  expanding  in  till  directions  so  that  die  si/.c  and 
brightness  change  as  a  function  of  time.  The  velocities  projected  on 
the  image  plane  are  the  same  as  for  the  case  of  die  reflectance  image. 
Wc  expect  the  brightness  changes  to  he  proportional  to  die  actual 
brightness  or  depth,  l-igure  6  shows  these  anticipated  velocities  and 
brightness  changes,  figure  7  shows  d.e  velocity  field  which  is 
computed  by  the  Horn  and  Sclumek  method  (i.e.,  with  die  assumption 
tint  there  are  no  pattern  changes.  ill/tlhO).  As  expected,  we  obtain  a 
large  flow  discontinuity  at.  die  center  of  the  image  of  die  ellipsoid. 
Figure  X  shows  die  result  of  our  method  in  which  pattern  changes  arc 
allowed,  lire  velocities  and  pattern  changes  are  very  close  to  the 
expected  resnhs 

Experimental  Results  for  Heart  Images 

We  hare  applied  the  methods  described  in  this  paper  to  x  ray 
images  of  a  dug’s  heart  taken  on  film  at  60  frames  a  second.  Figure 
9  shows  an  example  of  a  single  frame  of  the  cine  angiogram.  A  radio- 
opaque  dye  was  injected  into  die  pulmonary  artery  just  before  die 
image  sequence  was  taken.  I  he  dye  can  be  seen  filling  die  left 
ventricle,  the  aorta  and  some  of  die  coronary  arteries,  flic  other 
obvious  structures  in  die  images  arc  a  couple  of  catheters  left  over 
from  some  previous  injections,  file  film  was  digitized  with  X  bits  per 
pixel  and  resulted  to  100  x  100  pixels. 

The  velocities  were  computed  using  equations  (II).  Pattern 
changes  caused  by  the  expansion  and  contraction  of  die  heart 
perpendicular  to  die  image  plane  were  permitted  and  discontinuities 
in  die  velocity  flow  were  accommodated  at  image  boundaries  as 
described  above.  The  image  boundaries  were  located  at  die  zero 
crossings  of  die  l.aplaeian  of  a  smoothed  version  of  a  pair  of  sequential 
images.  The  computed  velocities  arc  shown  in  Figure  10.  To  verify 
dtesc  results,  flic  computed  motion  description  is  used  to  predict  the 
brightness  in  a  subsequent  image  from  die  brightness  in  die  previous 
image.  A  comparison  (if  the  predicted  and  actual  images  shows  an 
ciror  of  less  than  0.5%.  While  this  does  not  show  that  the  motion 
description  is  actually  a  good  one,  it  does  show  that  the  algorithm  is 
working  as  expected.  In  order  to  get  a  subjective  opinion  of  the 
validity  of  die  motion  description,  wc  have  generated  a  movie  of  die 
velocity  vectors  for  an  entire  heart  cycle  and  shown  that  it  coincides 
well  with  the  apparent  motion  seen  in  die  actual  ctne  angiogram. 
While  the  motion  description  obtained  from  die  analysis  of  x-ray 
images  may  be  useful,  it  does  not  provide  explicit  information  about 
motion  of  object  surfaces.  This  sort  of  information  might  be  obtained 
by  using  additional  views  of  dte  object  from  different  angles,  or  by 
considering  a  priori  information  about  the  object's  shape  or  symmetry. 

Summary 

This  paper  extends  die  work  of  Horn  and  Scbunck  on  optical 
flow.  Their  velocity  smoothness  constraint  is  relaxed  at  boundaries  to 
permit  discontinuities  in  estimated  velocity  where  tlicrc  are  occluding 
boundaries.  It  is  not  necessary  to  segment  die  image  into  objects  in 
order  to  use  boundary  information.  Rather  it  is  only  necessary  to 
locate  possible  boundaries  which  can  be  done  by  locating  zero 
crossings  in  the  l.aplaeian  of  the  smoothed  image  brightness.  Images 
of  an  expanding  ellipsoid  were  used  to  test  the  resulting  iterative 
algorithm,  flic  results  showed  that  discontinuities  in  the  velocity  flow 
could  be  accommodated,  but  that  the  velocity  at  die  actual  boundary 
tnay  be  inaccurate.  It  is  possible  to  estimate  die  velocities  at  die 
boundary  using  another  technique  and  then  to  use  diese  estimates  as 
input  to  the  iterative  algorithm. 


Though  die  techniques  were  originally  developed  for  reflectance 
images,  wc  have  begun  to  apply  them  to  x-ray  density  images.  To  do 
diis  it  lias  been  necessary  to  relax  Horn  and  Schurick's  restriction  dial 
the  appearance  of  an  object  nor.  change  from  one  image  to  die  next, 
lb  s  was  done  by  assuming  that  the  pattern  changes  in  die  image  vary 
gradually  within  object  boundaries.  Velocity  flow  patterns  for  x-ray 
images  of  die  expanding  ellipsoid  were  obtained  in  this  way.  The 
results  showed  dial  die  computed  velocities  parallel  to  tire  image  plane 
were  very  close  to  diose  obtained  for  reflectance  images  and  die 
pattern  changes  were  very  nearly  proportional  to  die  velocities 
perpendicular  to  die  image  plane.  The  technique  was  also  used  to 
analyze  x-ray  cine  angiogram'  of  a  beating  heart  and  produced  a 
subjectively  good  description  of  die  motion. 
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0.  1  Coordinate  System 


The  camera  model  referred  to  through  out  this  paper 
consists  of  a  planar  retina  embedded  in  a 
three-dimensional  Cartesian  coordinate  system 
(X,  Y,  Z)  ,  with  the  origin  at  the  focal  point  and  the 
optical  axis  aligned  with  the  positive  Z-axis 
(figure  1).  The  X  and  Y  axes  correspond  to  the 
gravitationally  intuitive  horizontal  and  vertical 
directions  respectively.  The  image  plane  is 
parallel  to  the  XY  plane  and  at  some  distance  along 
the  Z  axis.  Positions  in  the  image  plane  are 
described  using  a  2-d  coordinate  system  aligned 
with  the  X  and  Y  axes  of  the  camera  coordinate 
system  and  with  the  origin  determined  by  the 
intersection  of  the  image  plane  and  the  Z-axis. 


Figure  1 


0.  2.  Translational  Ho t ion  Properties 


It  is  useful  to  have  a  set  of  terms  for  describing 
the  motion  of  features  in  an  image  sequence  and  the 
corresponding  motion  of  environmental  points.  We 
define  an  Image  Displacement  Vector  to  be  a 
two-dimensional  vector  describing  the  displacement 
of  an  image  feature  from  one  image  to  the  next.  An 
Image  Displacement  Field  is  the  set  of  image 
displacement  vectors  for  successive  images.  An 
Image  Displacement  Sequence  indicates  the  positions 
of  an  image  feature  over  several  successive  images. 
Though  we  are  dealing  with  discrete  image 
sequences,  it  is  often  possible  to  descibe  the 
continuous  curve  along  which  an  image  feature  point 
is  moving.  This  curve  is  called  the  Image 
Displacement  Path  . 


Corresponding  to  image  motions  we  use  a  set  of 
terms  for  describing  environmental  motions.  An 
Environmental  Pi splac ement  Field  is  the  set  of 
three-dimensional  vectors  indicating  the  positions 
of  environmental  points  at  successive  instants.  An 
Environmen tal  Displacement  Sequence  indicates  the 
position  of  an  environmental  point  over  several 
successive  instants.  An  Env ironmental  Displacement 
Path  describes  the  three-dimensional  curve  that 
points 


environmental 
particular  motions 


are  moving  along  for 


For  general  camera  motion,  there  are  5  parameters 
[PRA81  ]  that  can  be  recovered  from  processing  image 
motion  without  knowing  absolute  camera  displacement 
or  velocity  (since  absolute  depth  is  lost):  two 
parameters  for  the  unit  vector  ( T  1  ( t )  ,  T2(t))  which 
describes  the  axis  of  translational  motion  at  time 
t;  two  parameters  for  the  unit  vector  (R1(t), 
R2(t))  describing  the  axis  of  rotation  at  time  t; 
and  one  parameter  R3(t)  which  describes  the  extent 
of  rotation  about  this  axis  at  time  t.  Both  of 
these  axes  are  positioned  at  the  origin  of  the 
camera  coordinate  system.  The  problem  of 
processing  image  motion  resulting  from  rigid  body 
camera  motion  can  be  organized  into  subcases  of 
increasing  complexity,  corresponding  to  the  number 
of  camera  motion  parameters  that  are  unconstrained. 


For  purely  translational  motion,  the  image 
displacement  paths  are  determined  by  the 
intersection  of  the  translational  axis  with  the 
image  plane.  If  the  translational  axis  intersects 
the  image  plane  on  the  positive  half  of  the  axis, 
the  point  of  intersection  is  called  a  Focus  of 
Expansion  (FOE)  and  the  image  motion  is  along 
straight  lines  radiating  from  it.  This  corresponds 
to  camera  motion  towards  environmental  points.  If 
the  translational  axis  intersects  the  image  plane 
on  the  negative  half  of  the  axis,  the  point  is 
called  a  Focus  of  Contraction  (FOC)  and  the  image 
displacement  paths  are  along  straight  lines 
converging  towards  it.  This  corresponds  to  camera 
motion  away  from  environmental  points.  The 
intersections  of  axes  parallel  to  the  image  plane 
are  points  at  infinity  and  are  treated  as  FOEs . 


Given  the  direction  of  translation  and  image 
displacements,  the  relative  depths  of  points  can  be 
computed  by  solving  the  inverse  perspective 
transform.  Relative  depth  can  also  be  inferred 
from  the  position  of  a  feature  and  the  extent  of 
its  displacemr.it  relative  to  an  FOE  or  an  FOC. 
This  relation  is  expressed  as 


1) 


_D_  =  7_ 

AD  AZ 


where  Z  is  the  value  of  the  Z  component  of  an 
environmental  point  at  time  t  +  1,  del  Z  is  the 
extent  of  environmental  displacement  along  the  Z 
axis  from  time  t  to  time  t  +  1,  D  is  the  distance  of 

the  corresponding  image  point  from  the  FOE  or  FOC 
at  time  t  and  del  D  is  the  image  point's 
displacement  from  time  t  to  time  t  +  1.  Thus,  the  Z 
value  of  an  environmental  point  can  be  recovered 
from  image  measurements  in  units  of  del  Z,  or  what 
has  been  termed  Ti m e-Un ti  1  -Co n tac t  by  Lee  [LEE80]. 

The  set  of  all  possible  translational  axes 
describes  a  unit  sphere  called  the  translational 
direction  s phere .  The  procedures  below  are  defined 
with  respect  to  this  sphere,  rather  than  the  image 
plane  itself,  for  reasons  described  below. 
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1.0  EXTRACTION  OF  INTERESTING  POINTS 


The  feature  extraction  process  is  used  to  determine 
small  areas  (sometimes  called  image  points  or 
features)  in  an  image  that  are  distinct  from 
neighboring  areas.  This  distinctiveness  limits  the 
potential  matches  of  these  image  areas,  and 
possibly  reflects  a  correspondence  to  actual  and 
significant  points  in  the  environment,  such  as 
points  of  high  curvature  on  ob/ject  boundaries, 
texture  elements,  surface  markings,  etc.  (However 
some  features,  termed  fal se  features ,  will  result 
from  noise,  occlusion,  and  light  source  effects  and 
have  behavior  which  is  currently  difficult  to 
interpret).  Features  can  be  represented  either  as 
arrays  of  numbers  extracted  directly  from  an  image, 
or  as  parameterized  tokens  describing  local  image 
properties.  In  this  paper,  we  refer  to  features 
exclusively  as  small  arrays  of  data  values  centered 
at  some  point  in  an  image  at  some  time  t. 

Following  Moravec  [MOR77.MOR80] ,  the  method  of 
feature  extraction  used  here  is  based  upon  finding 
image  areas  which  are  signi  ficantly  di  f  ferent  than 
their  neighboring  areas.  Using  a  correlation 
measure  bounded  between  1  (for  perfect  correlation) 
and  0,  the  distinctiveness  of  a  feature  is  1  minus 
the  best  correlation  value  obtained  when  the 
feature  is  correlated  with  its  immediately 
neighboring  areas.  Good  features  can  then  be 
selected  by  finding  the  local  maxima  in  the  values 
of  the  distinctiveness  measure  over  an  image. 

We  have  extended  this  approach  somewhat  by 
constraining  the  neighborhoods  over  which  the 
features  are  selected  to  contours  determined  by 
other  global  processes,  such  as  2ero-crossing 
extraction  and  thresholding,  which  are  sensitive  to 
edges. 


1.  1  Feature  Extraction  Us i ng  Zero-Crossings 


The  use  of  zero-crossings  to  determine  significant 
image  contours  at  different  levels  of  resolution 
has  been  proposed  and  extensively  studied  by  Marr 
e*-.  al  .  [HIL80 ,  MAR80 ].  In  this  processing  an 
image  is  convolved  with  Ga ussian-Lapl acian  masks 
((del)**2g)  of  different  positive  widths  and 
thresholded  at  zero  to  determine  zero-crossing 
contours.  These  contours  are  significant  since 
they  correspond  to  the  points  of  greatest  change  in 
the  convolved  image.  The  distinctiveness  measure 
can  be  applied  to  points  along  these  contours  in 
the  convolved  image,  with  the  local  maxima 
determining  the  position  of  potential  features. 
This  generally  has  the  effect  of  finding  points  of 
high  curvatui  e  along  the  zero-crossing  contour, 
although  points  apparently  corresponding  to  local 
occlusion  vertices  and  weak  maxima  will  also  be 
extr  ac  ted 


Many  of  the  weak  features  which  are  local  maxima  of 
distinctiveness  can  be  removed  by  suppressing  those 
which  are  at  points  of  low  curvature  along  the 
zero-crossi-g  contours.  For  features  which  are 
local  distinctiveness  maxima,  we  approximate  the 
curvature  along  the  contour  by  the  inner  product  of 
the  normalized  vectors  describing  the  relative 
positions  of  adjacent  local  maxima  along  the 
contour  (figure  2).  These  values  are  then 
thresholded  between  1.0  (corresponding  to  high 
curvature)  and  -1.0  (corresponding  to  low 
curvature)  . 


Figure  2 

Use  of  zero-crossing  based  features  requires 
specification  of  the  sizes  of  the  convolution  masks 
that  are  employed,  and  deciding  whether  to  position 
extracted  feature  points  with  respect  to  the 
unprocessed  image  or  the  convolved  images.  It  is 
usually  beneficial  to  use  masks  of  various  widths 
for  sensitivity  to  features  at  different  levels  of 
resolution.  The  processing  described  bel  iw  can  be 
applied  independently  to  the  pairs  of  successive 
images  formed  by  convolving  the  successive  images 
with  two  such  masks.  Alternatively,  features  can 
be  extracted  from  the  original,  unfiltered  image  at 
the  positions  where  features  were  determined  in  the 
convolved  images,  though  experience  with  large 
masks  has  shown  that  this  approach  can  position 
features  significant  distances  from  their  apparent 
position  in  the  original  image. 

The  images  in  figure  ja  and  figure  3b  were  taken 
from  a  gyroscopically  stabilized  movie  camera  held 
by  a  passenger  in  a  car  travelling  down  a  country 
road  in  Massachusetts.  They  are  128x123  pixel 
images  with  6  bits  of  resolution  in  intensity  and 
will  be  referred  to  as  the  jvoadsign  images.  Figure 
3c  stows  the  zero-crossings  extracted  ’ from  the 
initial  roadsign  image  using  a  (del)**2g  mask  with 
a  width  of  5  pixels.  The  distinctiveness  values 
were  computed  using  features  which  were  5x5  pixel 
arrays  extracted  from  the  convolved  image  and 
centered  on  pixels  which  were  adjacent  to  the 
zero-crossing  contour  and  of  positive  value.  These 
features  were  correlated,  using  Moravec' s  norm  (see 
below),  with  their  8  immediately  neighboring 
features.  The  distinctiveness  measure  for  a 
feature  was  set  to  1  minus  the  best  correlation 
obtained  in  its  neighborhood,  excluding  itself. 
Figure  3d  shows  the  local  maxima  in  the 

distinctiveness  measure  fositioned  with  respect  to 
the  zero-crossing  contour.  Figure  3e  shows  the 
results  of  suppressing  low-curvature  points  using  a 
threshold  set  to  -0.8  (corresponding  to  an  angle  of 
1^3.  13  degrees)  . 
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Other  types  of  contour  extraction  can  be  used 
besides  zero-crossings  and  total  image  thresholds. 
A  simple  one  is  to  constrain  the  extraction  of 
interesting  points  to  positions  where  image 
contrast  exceeds  some  minimal  value.  Another  is  to 
use  contours  determined  by  local  application  of 
histogram  guided  thresholding  and  segmentation. 
This  resolves  many  of  the  problems  associated  with 
using  a  single  threshold  determined  for  image 
subparts  with  significantly  different  brightnesses. 


2.  2  Interpolation  Process 


Developing  the  error  measure  requires  a  measure  for 
the  degree  of  match  between  features  and  an 
interpolation  process  for  determining  positions 
along  an  image  displacement  path.  Each  of  these 
can  be  implemented  in  various  ways  with  the  choices 
generally  involving  a  trade-off  between  the  speed 
of  evaluating  the  error  measure  and  the  precision 
with  which  the  translational  axis  can  be 
determined . 


2.  1  Ma  tch  Me  tr  ic 


There  are  several  metrics  for  similarity  of  nxn 
pixel  features  of  the  form  A(i,j)  and  B(i  ,  j)  ,  where 
i  ranges  from  1  to  n  and  j  ranges  from  1  to  n.  We 
have  utilized: 


Normalized  Correlation 


l  l  A  ( ! ,  j )  <  B  ( i  ,  j ) 

_ _i_J_ _ 

/l  \  A(i,j)*A(i,j)  x  /TTB(i.j)*B(i.j) 
/  '  j  /  i  j 


Moravec  Correlation  [M0R77] 

1  l  A ( i , j ) XB  ( i  ,  j ) 

i  j 

3) - - - - - - - 

(((I  l  A(i  ,  j  )  x  A  ( i  ,  j ) )  +  il  l  B(i,j)xB(i,j)))/2) 

'  j  '  j 


The  interpolation  process  approximates  the 
potential  displacements  of  a  feature  from  an 
initial  image  into  a  succeeding  image.  Depending 
on  the  accuracy  required,  positions  along  the  image 
displacement  path  can  be  approximated  roughly  by 
setting  the  coordinates  of  the  feature's  position 
to  the  nearest  integer  value,  or  more  accurately  by 
performing  a  subpixel  interpolation  of  the  feature 
at  each  of  a  set  of  selected  positions  along  the 
image  displacement  path.  The  basic  trade-off  is 
between  speed  and  accuracy,  with  subpixel 
interpolation  being  a  more  expensive  computation. 

2.3  Error  Measure 


The  error  measure  associates  with  a  point  on  the 
direction  of  translation  sphere  a  number  describing 
the  quality  of  feature  matches  along  the  image 
displacement  paths  determined  by  the  corresponding 
translational  axis.  This  error  value  is  computed 
by  first  finding  the  best  match  for  each  feature 
along  a  segment  of  the  image  displacement  path 
determined  by  the  translational  axis  using  one  of 
the  normalized  match  metrics  above.  Each  of  these 
values  is  then  subtracted  from  one,  and  all  the 
resulting  values  are  added  together  to  form  an 
error  measure.  Thus,  for  a  set  of  N  features  in  an 
initial  image,  a  hypothesized  translational  axis, 
and  use  of  one  of  the  match  metrics  above,  the 
error  measure  is 


n 

5}  J  (1.0  -  bestmatch ( i ) ) 


Normalized  Absolute  Value  Diff»rence 
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l  l  abs (A ( i , j ) - 5 ( i , j) ) 
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l  l  A  ( i ,  j  )  +  y  l  B  ( i  ,  j ) 
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All  of  these  measures  have  a  value  of  1  for  a 
perfect  match.  Cf  these,  the  first  choice  is  the 
most  conventional,  the  second  a  good  approximation 
to  the  first,  and  the  third  is  the  quickest  to 
ev  al  ua  te  . 


where  bestmatcht  i)  is  the  best  match  value 
associated  with  feature  i  along  the  image 
displacement  path  determined  for  it  by  the 
hypothesized  translational  axis. 


The  error  measure  was  computed  in  two  forms  in  the 
experiments  below:  a  fast  evaluation  form  and  a 
precise  evaluation  form.  The  fast  form  uses  the 
absolute  value  norm  and  the  nearest  integer 
approximation  to  determine  feature  position  along 
the  image  displacement  paths.  The  fast  form  is 
useful  for  evaluating  image  sequences  with  several 
extracted  features  to  determine  the  rough  position 
of  the  global  minimum.  However,  the  fast  form  is 
not  adequate  for  fine  determination  of  the 
translational  axis  because  it  does  not  vary 
smoothly  with  respect  to  small  changes  in  the 
position  of  a  translational  axis,  due  to  the 
nearest  integer  approximation  for  feature  position. 


The  precise  form  of  evaluation  uses  the  Moravec 
norm  and  bi-linear  interpolation.  It  has  been 
found  to  vary  smoothly  with  respect  to  small 
changes  in  the  position  of  a  translational  axis. 


2.4  Search  Organization 


The  search  process  used  here  consists  of  two 
phases.  A  global  sampling  of  the  error  measure 
determines  the  rough  shape  of  the  error  surface, 
then  a  local  search  determines  the  minimum.  The 
local  search  begins  at  the  position  where  the 
minimun  value  was  determined  by  the  global 
sampling.  1116  procedure  used  for  the  local  search 
is  steepest  descent  with  a  diminishing  step-size. 
That  is,  the  steepest  descent  procedure  begins  with 
a  initial  fixed  step  size  and  determines  a  local 
minimum  using  it.  The  step-size  is  then  rec.uced 
and  the  procedure  repeated  until  the  step-size  is 
at  the  desired  resolution  for  the  determination  of 
the  translational  axis.  In  the  experiments  below 
the  initial  step-size  was  set  to  0.1  and  then 
reduced  successively  to  0.025  and  0.005  radians. 


In  general,  the  error  measure  has  been  found  to  be 
smooth,  with  a  single  minimum  in  a  large 
neighborhood  around  the  correct  translational  axis. 
Thus,  the  global  sampling  can  be  quite  sparse  or 
the  initial  step  size  of  the  local  search  quite 
large . 


3.0  EXPERIMENTS 


This  procedure  has  been  applied  to  several 
different  image  sequences  under  various  conditions. 
These  have  included  adding  spurious  and  weak 
features;  whether  the  features  were  sparse  and 
scattered  across  the  initial  image  or  whether  the 
features  were  in  a  limited  image  area;  and  whether 
the  translational  axis  intersected  the  image  plane 
in  a  visible  portion  of  the  image  or  whether  it 
didn't.  These  experiments  have  shown  that  the 
procedure  is  robust  in  several  important  ways.  It 
is  resilient  with  respect  to  weak  and  false 
features.  It  can  use  a  small  number  of  features 
positioned  across  an  image  surface  or  a  small 
number  of  features  frcm  a  limited  area  of  the 
image.  And  it  is  not  affected  by  the  orientation 
of  the  translational  axis. 


Results  of  processing  the  roadsign  images  are  shown 
in  figures  l  1  and  6.  Ih«  global  search  wae  weed 
with  the  absolute  value  norm  and  nearest  integer 
interpolation.  The  sampling  increment  corresponded 
to  the  vectors  on  the  direction  of  motion  n  t 
being  separated  by  .31^1 57  radian  increments.  The 
maximal  image  displacement  was  set  to  10  pixels. 
Using  features 


centered  at  the  positions  shown  in  figure  3e ,  the 
global  sampling  determined  a  minimum  in  the  error 
i  unction  at  the  unit  vector  (-.80902,  -.47554, 

.3^548)  on  the  direction  of  translation  sphere. 


The  local  search  then  used  the  Moravec  norm  and 
bi  — linear  interpolation.  The  determined 

translational  axis  was  (-.83738,  -.42043,  .34933). 

Hie  displacements  of  the  feature  points  from  figure 
3d  for  this  translational  axis  are  shown  in  figure 
6. 


The  procedure  was  re 
the  positions  frcm 
curvature  suppressio 
introducing  weak 
computation.  The  tr 
(-.82909,  -.422  81, 

0. 01 863  radians  or 
determined  using  t 
3e .  Since  the  earner, 
1 ,  the  angle  betwei 
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:peated ,  but  using  features  at 
1  figure  3d  (those  prior  to  low 
>n).  This  has  the  effect  of 
and  false  features  into  the 
anslational  axis  extracted  was 
•  36585  )  This  is  a  difference  of 
1.06765  degrees  from  that 
he  features  indicated  in  figure 
a  focal  length  was  longer  than 
en  the  determined  translational 
siderably  less  than  this. 


The  procedure  was  also  applied  using  the  features 
from  the  restricted  suba-ea  shown  in  figure  7, 
corresponding  to  seme  faint  tree  texture.  Using 
these  features,  the  t'-anslational  axis  extracted 
was  (-.  84281,  -.42928,  .32465).  This  is  a 
difference  of  0.02677  radians  or  1 .  5341  8  degrees 
with  uhe  translational  axis  determined  using  the 
feature  centered  at  the  positions  indicated  in 
figure  3e  . 


Given  the  direction  of  translation  and  image 
displacements,  the  relative  environmental  depths  of 
image  points  can  be  recovered  by  the  simple 

relation  in  equation  1.  When  image  displacements 
are  small,  the  inferred  depth  values  can  be  quite 
erratic  due  to  sensitivity  to  small  numbers  in  the 
denominator  in  the  left  hand  side  of  equation  1. 

For  this  reason,  it  is  necessary  to  keep  track  of 
the  image  displacements  over  several  successive 
images  with  concurrent  updating  of  the  inferred 
depth  values.  This  was  done  using  a  sequence  of 

four  successive  images  from  the  roadsign  sequence 
beginning  with  roadsign  images  1  and  2  and  using 
the  features  from  image  1  at  the  positions  in 

figure  3e.  The  position  of  the  translational  axis 
determined  from  images  I ( t)  and  I(t+1)  was  used  as 
the  initial  value  in  the  local  search  for 

determining  the  translational  axis  for  images 
I ( t +1 )  and  I(t+2)  . 


The  displacements  of  all  features  along  the  contour 
ifi  f  iglif  -  MeT  c  tv  al 'i.atel  cj  1 U ’ ;g  ttie  image 

displacement  paths  determined  by  the  translational 
axis  found  for  images  .1(1)  and  1(4).  From  these 
SfcifA star*'  1 1  tJf# ■  -TeiffcJt  iMlwte  Stage  points 

along  the  contour  were  oCsviputed  using  equation  1. 


£ 


Ihe  road  sign  sequence  is  particularly  nice  for 
presenting  depth  processing  results  because  the 
three  environmental  objects  in  the  images  are  at 
three  distinct  depths.  This  is  shown  in  figure  ga 
by  the  three  distinct  clusters  in  the  histogram  of 
the  depth  values  calculated  for  the  points  along 
the  contour.  The  units  in  the  hist  og  ran  are 
cummulative  time-until-cortact  values.  That  is, 
the  depth  is  given  in  units  of  the  displacement  of 
the  camera  from  1(1)  to  1(H)  along  the  Z-axis. 
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5.0  DISCUSSION 
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Figure  8c 


We  now  discuss  particular  aspects  of  the  procedure 
and  then  consider  several  extensions  and 
applications. 


5.1  Feature  Extraction 


Since  the  procedure's  performance  does  not  degrade 
severely  due  to  the  occurrance  of  poor  features, 
the  type  of  feature  extraction  used  is  not 
critical.  Nonetheless,  the  feature  extraction 
process  used  here  could  be  extended  in  many  ways. 
The  low-curvature  suppression,  if  it  is  used,  could 
take  into  account  boundary  length  along  a  contour 
between  distinctiveness  maxima  to  determine  whether 
to  suppress  or  generate  a  feature  for  further 
processing.  It  is  also  possible  to  determine 
points  of  high  curvature  along  the  boundary  with 
out  having  to  walk  along  the  contour  by  the 
modifications  discussed  above  in  section  1.2  or  by 
using  other  operators  which  can  directly  measure 
curvature  [KIT80]. 


Another  useful  extension  would  be  to  use 
information  determined  from  the  extraction  of  the 
translational  axis  to  isolate  false  features.  This 
could  involve  removing  those  features  which  have 
weak  matches  from  the  error  measure  calculation 
once  a  translational  axis  has  been  determined  and 
re-evaluating.  Alternatively,  the  depth  inferences 
could  be  used  to  isolate  the  positions  of  potential 
false  features  by  noting  discontinuities  in  depth 
along  an  extracted  contour.  Extracted  features 
could  be  removed  from  the  re-evaluation  of  the 
error  measure  if  they  are  at  or  near  such 
positions.  Another  type  of  feature  which  ca' 
affect  the  evaluation  of  the  error  measure  are 
those  near  an  FOE  or  FOC  which  is  contained  in  a 
visible  portion  of  the  image.  Such  features  tend 
to  move  very  small  amounts  along  their  image 
displacement  paths  and  hence  require  fine 
interpolation  to  determine  their  best  matches. 


5.  2  Properties  of  the 


Error  Measure 


Figure  8d 


In  the  experiments,  the  error  measure  has  a 
distinct  global  minimim  at  the  point  on  the  unit 
sphere  corresponding  to  the  correct  trarslational 
axis.  It  is  expected  to  have  such  behavior 
generally  because  it  is  very  unlikely  that 
translational  axes  that  are  far  from  the  correct 
position  will  define  image  displacement  paths  that 
simultaneously  allow  good  matches  for  many 
features.  Thus  competing  candidates  for  the  global 
minimum  should  not  be  widely  separated. 
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The  error  measure  is  affected  by  both 

non-distinctive  and  false  features. 

Non-distinctive  features  will  match  well  for  many 
different  translational  axes.  Large  numbers  of 
these  weak  features  will  flatten  the  response  of 
the  error  measure.  False  features  will  also 

distort  the  error  measure  since  they  will  often 
have  their  best  matches  with  incorrect 
translational  axes. 

The  effects  of  these  poor  features  should  be 

compensated  by  the  agreement  of  good  features. 
Every  one  of  the  good  features  will  tend  to  have  a 
bad  match  for  the  incorrect  translational  axis  and 
their  unanimity  is  expected  to  overide  the  lack  of 
discrimination  of  weak  features  and  the  random 
quality  of  the  matches  of  false  features. 


5.  3  Utility  of  the  Direction  of  Tran slation  Sphere 


There  are  significant  advantages  in  defining  the 
error  measure  with  respect  to  a  unit  sphere, 
instead  of  the  potential  positions  of  FOEs  and  FOCs 
in  the  image  plane.  The  sphere  is  a  bounded 
surface  which  makes  uniform  global  sampling  of  the 
error  measure  feasible.  Additionally,  the 
resolution  in  the  position  of  the  translational 
axis  varies  across  the  surface  of  the  image  plane. 
For  example,  the  FOEs  determined  by  translational 
axes  seperated  by  very  small  angles  will  be 
seperated  by  larger  and  larger  distances  in  the 
plane  as  the  intersections  of  the  translational 
axes  and  the  image  plane  are  placed  further  irom 
the  visible  image.  The  effect  of  this  on  the  error 
measure,  when  it  is  defined  over  the  image  plane, 
is  large  flat  areas  for  FOEs  further  from  the 
visible  portions  of  the  image.  Finally,  special 
criteria  must  be  used  to  distinguish  between  FOEs 
and  FOCs  if  the  error  measure  is  defined  relative 
to  the  image  plane.  Roughly  parallel  image 
displacements  could  correspond  to  an  FOE  off  to  one 
side  of  the  image  plane  or  to  an  FOC  off  to  the 
opposite  side.  Oi  the  direction  of  translation 
sphere,  the  corresponding  translational  axes  would 
be  close  while  on  the  plane  they  are  completely 
separated  . 


5.  9  Optimi  zation  Procedure 


The  optimization  procedure  used  here  is  very 
simple,  and,  because  of  the  strong  unimodality  of 
the  error  measure  and  its  smoothness,  other 
techniques  with  more  rapid  convergence  could  be 
used.  It  is  interesting  to  note,  however,  that  the 
global  component  of  the  optimization  performed  here 
is  an  instance  of  a  generalized  Hough  Transform 
[BAL81,  0  ' RO 81  ]  in  which  each  feature  scales  its 

vote  for  a  particular  translational  axis  by  the 
best  match  it  can  find  consistent  with  the 
translational  axis. 


6.0  EXTENSIONS  AND  APPLICATIONS 


6.  1  Hier  arc  hie  al  Computa  tion 


A  basic  paradigm  in  computer  vision  is  the  use  of 
hierarchical  representations  and  processes.  This 
allows  for  different  magnitudes  and  scales  of  image 
events  to  be  handled  uniformly.  The  consistent 
agreement  among  hierarchically  organized  processes 
is  a  basic  control  strategy  for  interpretation 
processes.  Additionally,  hierarchical  processing 
can  produce  significant  speed-ups  wherein  results 
from  processing  done  rapidly  at  lower  resolutions 
of  image  information  are  used  to  direct  and 
constrain  more  detailed  and  extensive  processing  of 
higher  resolution  image  in  form.: f  ion  . 

The  translational  motion  procedure  car  be  developed 
in  a  hierarchical  fashion  with  the  primary  benefits 
being  increased  speed  and  the  ability  to  deal  with 
large  image  displacements.  This  development 
requires  specifying  the  hierarchical 

representations  of  the  successive  images  and  the 
extracted  features  and  how  processing  at  different 
levels  of  image  resolution  are  related. 

In  the  initial  work  described  here,  images  have 
been  represented  in  the  VISIONS  image  operating 
cone  structure  [HAN81  ].  This  consists  of  a 
sequence  of  images  10, 1  1, 1 2, . . .  In  where  the 
successive  sizes  of  .he  images  are 
1  x  1, 2x2,  <4 x4, . . . , 2**n  x  2**r  .  Each  pixel  in  the 
i-th  images,  except  for  the  Urst  and  last  images, 
has  a  connected  neighboorhood  of  immediate 
descend ents  in  the  i  +  1  image  and  a  unique  parent  in 
the  i-1  image.  (One  point  of  confusion:  we  speak 
of  going  up  and  down  the  cone  and  of  images  having 
higher  and  lewer  resolution.  Unfortunately,  as  we 
go  higher  up  the  cone,  image  resolution  gets  lower; 
and  as  we  go  down  the  cone,  image  resolution  gets 
higher.)  The  size  and  shape  of  the  immediate 
descend  ent  neighboorhood  can  be  arbitrary.  The 
immediate  descendent  neighboorhood  s  of  adjacent 
parallel  pixels  may  or  may  not  overlap. 


There  are  sev'-ai  ways  to  reduce  the  resolution  of 
an  image  and  project  it  up  the  VISIONS  cone  [HAN81 , 
BUR 82].  These  techniques  involve  smoothing  the 
image  with  some  operator  and  then  sampling  at  a 
reduced  interval.  This  can  be  expressed 
computationally  by  expressing  the  value  of  a  parent 
pixel  to  be  some  average  of  the  pixels  in  its 
immediate  descendent  neighboorhood.  The  results  of 
reducing  image  resolution  by  Gaussian  smoothing  for 
the  roadsign  images  is  shown  in  figures  9a-e. 

Extracted  features  can  also  be  represented  in  the 
cone  structure  at  different  levels  of  resolution. 
One  approach  is  to  apply  the  feature  extraction 
process  described  above  for  each  image  at  each 
resulting  level  of  resolution  in  the  cone.  Another 
technique  is  to  extract  features  in  the 
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highest  resolution  image  and  project  these 
extracted  feature  positions  up  the  cone,  thus  a 
feature  is  positioned  at  a  parent  pixel  if  any  of 
its  descendents  are  at  positions  where  a  feature 
has  been  extracted.  These  approaches  can  interact 
in  interesting  ways  if  the  strength  of  a  feature  is 
expressed  as  a  function  of  the  featureness  of  its 
ancestors  or  descendents.  Figures  lOa-c  show  the 
features  resulting  for  the  roadsign  images  at 
different  levels  of  resolution  by  projecting 
extracted  feature  positions  up  the  cone. 

Hie  translational  processing  can  be  applied  to 
successive  images  at  any  level  of  resolution  for 
whicl  features  have  been  extracted  from  the  initial 
image.  The  basic  questions  concern  how  processing 
at  one  level  effects  processing  at  another  level. 
In  particular,  how  do  processing  results  at  a  lower 
level  of  resolution  (higher  in  the  cone!)  constrain 
the  processing  at  higher  levels  of  resolution?  At 
what  level  in  the  cone  can  processing  be 
meaningfully  initialized?  How  do  the  various 
parameters  involving  feature  window  size, 
displacement  resolution  along  a  flow  path,  and 
resolution  of  the  optimization  procedure  change  at 
different  levels  of  the  cone? 


For  ?  given  pair  of  images  at  level  i  in  the  cones 
formed  from  successive  images,  the  error  function 
is  minimized  for  the  set  of  features  determined  at 
level  i  using  projection  up  the  cone  from  the  first 
image.  The  determined  minimum  at  level  i  is  then 
used  to  constrain  the  optimization  of  the  error 
function  for  the  images  and  feature  positions  at 
the  next  lower  level  in  the  cone.  In  addition  to 
constraints  on  the  position  of  the  error  function 
minimum,  processing  higher  in  the  cone  constrains 
the  evaluation  of  the  potential  displacements  of 
extracted  features  along  their  displacement  paths. 
For  each  displacement  determined  at  level  i  only 
three  positions  have  to  be  evaluated  at  along  the 
displacement  paths  at  level  1  +  1.  Thus  in 

processing,  not  only  is  the  minimum  of  the  error 
function  passed  on,  but  also  the  displacements  of 
parent  features  which  are  then  used  to  constrain 
the  evaluation  of  the  displacements  of  descendent 
features  at  the  next  lower  level. 

There  are  a  wide  range  of  possibilities  for 
implementing  the  error  function  minimization  at  the 
different  cone  levels.  The  different  resolutions 
used  in  the  step  size  of  the  error  function 
evaluation  could  be  correlated  with  a  particular 
image  level  at  which  processing  is  being  done. 
That  is,  as  processing  proceeds 


down  the  cone,  the  stepsize  of  the  error  function 
evaluation  would  decrease.  Alternatively,  a 
complete  search  could  be  done  at  each  level  before 
proceeding  further  down  the  cone.  Feature  size  can 
also  change  as  processing  goes  down  the  cone  since 
at  higher  levels  a  given  window  size  corresponds  to 
an  increased  area  with  respect  to  the  image.  At  a 
high  level  of  resolution,  features  described  by 
small  image  areas  may  not  be  distinctive  enough  to 
match  well. 

In  the  experiments  in  figures  lla-c  processing  was 
initialized  at  level  4  (16  x  16  images)  by 

performing  the  global  processing  as  above.  The 
resulting  flow  field  is  shown  in  figure  11a.  The 
first  step  of  the  local  processing  used  a  stepsize 

equal  to  0.1  radians  end  was  performed  using  the 

images  at  level  5.  The  resulting  flow  field  is 

shown  in  figure  11b.  At  level  6,  the  stepsize  was 
reduced  to  0.025  and  the  local  search  initialized 
at  the  minimum  determined  by  the  processing  done  at 
level  5.  At  level  7.  the  stepsize  was  reduced  to 
0.005  and  the  search  was  initialized  at  the  minimum 
determined  at  level  6.  5x5  windows  were  used  at 

each  level.  The  procedure  converged  to  the  same 
results  as  in  the  experiment  above. 

An  important  question  concerns  the  cone  level  at 
which  to  begin  processing.  One  criteria  could  be 
the  level  at  which  significant  changes  in  image 
values  occur  as  determined  by  an  average  difference 
value.  Another  could  be  the  response  of  the  error 
function.  This  would  involved  determining  the 
level  at  which  the  error  function  has  a  distinct 
minimum. 


Another  important  question  concerns  handling 
features  which  are  on  different  sides  of  an 
occlusion  boundary  but  share  a  similar  ancestor  in 
the  feature  tree.  In  this  case,  the  displacement 
value  inherited  from  the  parent  may  be  incorrect 
for  one  of  the  features  and  the  feature  should  have 
its  potential  displacements  re-evaluated  along  it 
displacement  path.  A  possible  criteria  to 

determine  the  need  for  re-evaluation  of  the 
displacements  for  a  feature  is  if  its  match  value 
is  ever  less  than  soue  threshold  or  is  less  than 
the  match  strength  of  its  parent.  It  may  be 
sufficient  simply  tv  not  evaluate  such  features  if 
they  are  found  and  determine  ;heir  displacements  or 
occlusion  after  the  more  certain  image 
displacements  have  been  determined  for  other  image 
"oints . 
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6.2  Translational  Blur  Path  Extraction 
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Figure  11a 


Blur  streaks  are  commonly  produced  when  the  shutter 
mechanism  of  a  camera  remains  open  while  the  camera 
is  moving  relative  to  a  textured  surface.  The 
streaks  are  produced  by  the  successive  positions  of 
the  image  projections  of  the  texture  elements. 
Recent  work  [HAR80,  SHE 83]  indicates  that  blur 
streaks  may  be  a  very  common  motion  effect  in  the 
human  visual  system. 

For  translational  camera  motion,  the  blur  streaks 
will  correspond  to  the  image  displacement  paths: 
straight  line  segments  radiating  from  a  common 
intersection  point.  The  techniques  developed  here 
can  be  easily  extended  for  the  extraction  of 
translation  blur  paths.  First,  it  is  necessary  to 
compute  the  gradient  of  the  blurred  image.  Ihe 
image  gradient  will  be  perpendicular  to  the 
translational  blur  paths  at  image  positions  which 
lie  along  a  translational  blur  path.  Thus,  the 
error  measure  becomes 


n 

6)  )  abs(cos0.) 

i  =  l 


where  i  is  an  index  over  image  positions  and 
theta(i)  is  the  angle  between  the  image  gradient  at 
point  i  and  the  translational  displacement  path 
corresponding  to  a  particular  translational  axis. 
The  same  evaluation  techniques  can  be  used  for  this 
error  function  as  above,  except  that  there  is  no 
need  to  distinguish  between  FOEs  and  FOCs .  Note 
that  in  the  analysis  of  translational  blur  paths, 
information  is  lost  concerning  the  direction  and 
magnitude  of  the  displacement  of  image  points  over 
time . 


Figure  11b 


Figure  11c 


It  may  be  useful  to  use  multiple  versions  of  the 
same  image  sequence  each  formed  with  a  different 
exposure  rate.  By  substracting  the  images  formed 
with  very  short  exposure  rates  (which  are  basically 
static  images  with  no  dynamic  information  contained 
within  but  edge  information  corresponding  to  actual 
environmental  structures)  from  those  with  longer 
exposure  rates,  it  may  be  possible  to  suppress 
edges  which  are  non-blur  related  in  the  blurred 
images.  Regardless,  the  more  blurred  the  images 
beccme,  the  more  the  static  image  structure  is 
reduced  . 

The  extraction  of  translational  blur  paths  is 
identical  to  the  extraction  of  vanishing  points  and 
lines  from  static  images.  This  same  procedure  can 
be  applied,  without  the  initial  extraction  of 
edges:  indeed,  the  determination  of  edges  can 
occur  concurrently  with  the  extraction  of  the 
vanishing  point.  We  have  successfully  applied  this 
procedure  to  the  extraction  of  translational  blur 
paths  and  the  extraction  of  vanisning  points  in 
natural,  outdoor  images. 
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6.  3  Multiple  Independ  ,*n tly  Hov ing  Obj ects 


The  procedure  developed  here  allows  for  a  sensor 
moving  relative  to  a  stationar y  environmen t  or  a 
single  object  moving  relative  to  a  stationary 
sensor.  A  useful  extension  would  allow  for 
multiple,  independently  moving  objects  while 

maintaining  the  ability  to  determine  image 

displacements  concurrently  with  the  direction  of 
translation.  There  are  at  least  three  techniques 
which  would  make  this  possible.  One  is  to  utilize 
generalized  hough  transform  techniques  for 

decomposing  the  responses  in  a  histogram  into  the 
corresponding  image  structures  or  segments.  The 

others  utilize  the  limited  image  areas  over  which 
the  procedure  can  successfully  function. 

As  pointed  out  in  the  discussion  above,  the  global 
sampling  of  the  error  function  is  an  instance  of  a 
generalized  Hough  transform.  Each  feature  is 
scaling  its  vote  against  a  particular  translational 
axis  by  the  extent  of  feature  mismatch  it  has  along 
an  image  displacement  path  determined  by  the 
translational  axis.  Without  changing  anything,  and 
to  be  consistent  with  other  developments,  instead 
of  using  an  error  measure,  we  could  use  an 
optimization  measure  by  which  each  feature  scales 
its  vote  for  a  particular  translational  axis  by  the 
extent  of  match  it  has  consistent  with  the  axis. 
The  problem  then  becomes  a  typical  one  for 
generalized  Hough  transforms:  how  to  associate 
labels  corresponding  to  the  resulting  peaks  in  the 
histogram  with  image  points  or  features.  The 
general  form  of  this  processing  is  to  find  the 
greatest  response  in  the  hough  transform,  associate 
a  label  with  it,  and  then  associate  this  label 
with,  in  this  case,  image  features  which  match 
above  some  threshold  (corresponding  to  strength  of 
match)  along  the  path  determined  by  the  axis.  The 
resulting  set  of  features  are  then  removed  and  a 
new  histogram  is  produced  (or  rehistog ramming)  . 
The  peak  in  this  new  histogram  is  determined,  a  new 
label  associated  with  it  and  mapped  onto  the 
corresponding  image  features.  This  process  is 
repeated  until  there  are  no  more  distinct  peaks  in 
the  resulting  hough  transforms  or  all  image 
features  are  labeled. 

Unfortunately,  this  procedure  will  have 
difficulties  with  weak  or  homogeneous  feature 
points  which  have  strong  matches  consistent  with 
several  distinct  translational  axes.  Thus,  when 
rehistogramming  occurrs  it  is  necessary  to 
establish  which  image  features,  which  are  already 
labeled,  are  consistent  with  the  newly  extracted 
peak.  This  is  costly  and  could  be  quite  messy.  An 
alternative,  is  to  proceed  in  the  conventional 
manner  and  determine  a  set  of  labels  corresponding 
to  translational  axes  for  which  there  is  evidence. 
Each  feature  is  then  labeled  with  each 

translational  axis  frem  this  set  with  which  it  is 
consistent.  Note  that  a  given  feature  could  have 
several  labels.  A  unique  consistent  labelling  is 
then  obtained  by  using  other  information: 

segmentation-grouping  using  other  image  attributes, 
depth  consistency  with  neighboors,  and  common 
magnitude  of  image  displacements.  Additionally, 
this  disambiguation  can  occur  over  several 
successive  images. 


Two  basic  questions  to  be  addressed  in  this  use  of 
Hough  techniques  are  what  is  the  required  density 
of  translational  axes  in  the  transform  and  what  is 
the  minimal  match  threshold. 

An  alternative  approach  is  to  break  the  image  into 
subparts  and  then  locally  apply  the  procedure  to 
associate  a  translational  axis  with  each  subpart. 
In  one  scheme,  this  would  be  done  using  regular 
image  areas  (as  in  a  grid).  In  another  scheme,  the 
subparts  are  determined  by  some  segmentation 
procedure  and  the  translational  axis  is  determined 
from  image  features  within  or  lying  along  the 
boundary  of  the  extracted  segments.  Segments  for 
which  the  error  function  response  is  indistinct  are 
resegmented  or  their  features  are  associated  with 
the  translational  axes  determined  for  adjacent 
image  sub  parts. 


6. 4  Loc  al  Tr  an  si  atio  nal  Decomposition 


The  technique  for  translational  motion  processing 
can  be  extended  to  less  restricted  forms  of  sensor 
motion  by  applying  the  procedure  to  small  areas 
across  an  image  surface  over  a  sequence  of  images. 
This  approximates  more  general  motions  as 
consisting  locally  of  environmental  translations 
and  interprets  local  image  motion  as  resulting  from 
environmental  translations.  The  feasibility  of 
this  is  based  upon  experiments  showing  that  the 
direction  of  translation  can  be  extracted  with 
reasonable  precision  using  small  image  areas 
containing  a  few  features.  The  resulting 
description  associates  with  a  set  of  image  points 
(or  snail  image  areas)  the  approximated  direction 
of  motion  of  the  corresponding  environmental  points 
(or  small  environmental  surface  area).  As  a  low 
level  representation  of  environmental  motion,  this 
can  considerably  simplify  the  recovery  of  the 
sensor  motion  parameters  [LAW 82].  It  can  also 
provide  qualitative  information  concerning  the 
rough  motion  characteristics  of  objects  in  a  scene. 

6.5  Other  Cases  of  Restricted  Motion 


The  procedure  developed  here  is  applicable  to  other 
cases  of  unknown  but  restricted  camera  motions  for 
which  it  is  computationally  feasible  to  search 
directly  through  a  subspace  of  the  camera  motion 
parameters.  Two  particular  cases  are  pure  sensor 
rotation  and  motion  constrained  to  a  known  plane. 

With  pure  sensor  rotation,  the  unknown  camera 
parameters  are  constrained  to  R 1  ( t)  ,  R2(t)  ,  and 
R3(t).  In  this  oase,  the  error  measure  is  defined 
with  respect  to  a  direc tion  of  rotation  sphere 
where  each  point  corresponds  to  an  axis  of 
rotation.  For  each  rotational  axis,  the  extent  of 
displacement  for  image  features  is  determined  by 
different  values  of  R3(t).  There  is  the  additional 
constraint  in  this  case  that  the  displacements  of 
all  features  must  correspond  to  the  same  value  of 
R3(t)  .  Thus  the  variance  of  the  determined  angular 
displacements  can  be  incorporated  into  the  error 
measure . 
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For  motion  constrained  to  a  known  plane,  the 
rotational  axis  is  known  to  be  perpendicular  to  the 
plane  and  the  translational  axis  is  constrained  to 
lie  in  it.  Thus,  only  R3(t)  and  one  translational 
parameter  can  vary  and  the  error  measure  can  be 
computed  with  repect  to  these  two  parameters.  The 
global  sampling  in  this  case  amounts  to  evaluating 
a  set  of  translational  axes  for  each  of  a  set  of 
potential  rotations. 


6. 6  Hybrid  Sensor  Systems 


Translational  processing  is  sufficient  for 
vision-based  navigation  in  a  stationary  environment 
if  the  orientation  of  the  optic  sensor  can  be  fixed 
relative  to  the  environment  over  time.  In  this 
case,  sensor  motion  amounts  to  a  sequence  of 
translations  in  possibly  different  directions  over 
time . 

A  difficulty  with  such  a  stabilized  retina  is  that 
much  of  the  environment  would  not  be  observable. 
This  can  be  corrected  by  using  a  set  of  such 

stabilized  retinas  arranged  to  yield  a  complete 
view  of  space.  There  would  then  be  no  need  to 

rotate  the  sensor  to  view  a  particular 
environmental  point.  A  possible  arrangement  of 
retinal  surfaces  is  a  cubical  one.  One  of  the 
retinal  planes  will  always  contain  an  FOE  and 

another  will  always  contain  an  FOC  (unless  the 
direction  of  motion  is  right  on  an  edge  of  the  cube 
and  the  focal  length  has  not  been  properly 
adjusted).  There  will  also  be  several  independent 
estimates  of  the  direction  of  translation  which  can 
be  integrated. 

Alternatively,  if  the  sensor  can  not  be  stabilized, 
there  are  devices  which  can  at  least  determine  the 
rotational  parameters  of  sensor  motion.  The 

rotational  effects  can  then  be  removed  from 
successive  images,  reducing  them  to  translational 
sequences  which  can  be  processed  by  the  techniques 
here.  A  particular  technology  which  is  very 
attractive  for  this  use  is  Fiber  Optic  Rotation 
Sensors  [EZE82]  which  are  expected  to  be  tTie 
low-cost  gyroscope  of  the  near  future.  These 
devices  are  small,  cheap,  and  precise.  There  are 
currently  slow  drift  problems,  but  we  would  be 
concerned  with  measurements  of  rotation  over  very 
short  periods.  Additionally,  when  coupled  with  an 
image  processing  system,  such  long  term  drifts 
could  be  recognized  and  accounted  for  by  noting  the 
position  of  specified  lancknarks. 


7.0  CONCLUSIONS 


This  work  demonstrates  a  simple  and  robust 
procedure  for  determining  the  direction  of 
environmental  motion  and  image  displacements  in 
real-world  image  sequences  produced  by  translation. 
It  is  not  dependent  on  an  initial  matching  process 
prior  to  the  inference  of  camera  motion.  Instead, 
features  are  extracted  from  an  initial  image  and 
their  displacements  are  determined  concurrently 
with  the  inference  of  direction  of  sensor  motion. 
Thus  ccmpl ications  in  matching  that  arise  from  an 
individual  feature  being  extracted  in  one  image  and 
not  in  the  next  are  reduced.  The  process  is  also 
relatively  insensitive  to  weak  and  false  features. 
It  can  use  a  small  number  of  features  positioned 
across  an  image  surface  or  a  small  number  of 
features  from  a  limited  area  of  the  image.  It  has 
been  successfully  applied  to  image  sequences 
produced  by  a  car  translating  down  a  road,  by  a 
camera  attached  to  a  robot  manipulator  in  an 
industrial  environment,  and  to  artificially 
generated  sequences. 

We  further  considered  and  demonstrated  several 
extensions  and  applications  for  such  things  as 
independently  moving  objects,  translational  blur 
streaks,  other  cases  of  restricted  motion, 
computation  in  a  hierarchical  structure,  and 
potential  incorporation  into  hybrid  sensor  systems 
for  autonomous  navigation. 
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ABSTRACT 

The  problem  of  interpreting  the  shape  of  a  three-dimensional 
spare  curve  from  its  two-dimensional  perspective  image  contour 
is  considered  Observation  of  human  perception  indicates  that 
a  good  strategy  is  to  segment  the  image  contour  in  such  a  way 
as  to  obtain  approximately  planar  segments.  The  orientation  of 
the  osculatiug  plane  (the  plane  In  which  the  space  curve  lies)  can 
theu  he  estimated  for  these  segments,  and  the  three-dimensional 
shape  recovered.  The  assumption  of  spatial  isotropy  is  used 
to  derive  the  theoretical  results  needed  to  formulate  such  an 
estimation  strategy  The  resulting  estimation  strategy  allows  a 
single  three-dimensional  structure  (up  to  a  single  Necker  reversal) 
to  he  assigned  to  any  smooth  image  contour.  An  implementation 
is  described  and  show  n  to  produce  an  interpretation  that  is  quite 
similar  to  the  analytically  correct  one  in  the  case  of  a  helix,  even 
though  a  helix  has  substantial  torsion.  The  geueral  applicability 
of  the  algorithm  is  discussed. 

I  Introduction 

Much  recent  vision  research  has  emphasized  the  imoor- 
ta--e  of  image  coutour  for  shape  interpretation  [1 ,2, 3, 4, 5,6,/]. 
Tcncnbautn  and  Barrow  [l]  argue  that  image  contour,  for  ex¬ 
ample,  is  dominant  over  shape  from  shading.  Pentland  [R]  has 
presented  examples  in  which  the  addition  of  a  contour  substan¬ 
tially  improved  the  interpretation  of  a  shaded  surface.  It  seems 
that  contour  is  one  of  the  strongest  sources  of  information  for 
shape  perception. 

One  source  of  evidence  of  the  strength  of  contour  information 
is  line  drawings.  \Vheu  we  examine  a  line  drawing,  our  perception 
of  the  three-dimensional  shape  implied  by  suclt  a  drawing  is 
nearly  always  clear  and  unambiguous.  How  can  we  account  for 
this,  given  that  purely  geometrical  constraints  admit  of  an  infinite 
number  of  valid  interpretations? 

A.  An  Observation  About  Human  Perception 

W'hcu  we  observe  line  drawings  such  as  those  in  Figure  1 
(a),  wc  have  a  clear  perception  of  a  non-planur  three-dimensional 
structure.  Notice  that  if  we  were  to  segment  each  of  these  draw¬ 
ings  at  the  circled  points,  each  of  the  resulting  segments  would 
have  the  same  shape  as  they  did  when  they  were  still  hooked 
together  and  would  be  approximately  planar,  as  is  shown  in 
Figure  1(b).  Thus,  for  these  line  drawings  the  problem  of  recover¬ 
ing  the  three-dimensional  structure  can  be  reduced  to  the  prob¬ 
lems  of  (I)  segmenting  the  curve  into  perceptually  planar  seg¬ 
ments,  and  (2)  finding  the  plane  that  contains  each  of  the  curve 
segments  (the  osculating  plane )  (9).  Once  we  know  the  orienta¬ 
tion  of  the  plane  which  contains  a  curve  segment  we  can  then 
easily  determine  its  three-dimensional  shape. 

*  The  research  reported  herein  was  supported  by  the  Defense 
Advanced  Research  Projects  Agency  under  Contract  No.  MDA 
9Q3-R3- 0-0027;  this  contract  is  mocitored  by  the  U  S.  Army 
Lngirccr  Topographic  Laboratory.  Approved  for  public  release, 
distribution  unlimited 


Figure  i  (a)  Some  I  ine  Draw  ings,  (b)  Their  Planar  Subregions. 


If  wc  “by  hand’  try  to  segment  image  contours  into  planar 
regions,  wc  find  that  the  strategy  can  be  successfully  applied 
to  a  surprisingly  large  number  of  naturally-occurring  image  con¬ 
tours  For  some  contours,  however,  it  is  not  obvious  how  well 
this  strategy  will  work,  primarily  because  there  are  no  points 
which  segmeut  the  spare  curve  into  planar  regions.  An  example 
of  such  a  curve  is  the  helix  shown  in  Figure  2  (a).  Nonetheless, 
it  may  still  be  possible  to  obtain  a  good  approximation  of  the 
three-dimensional  structure  of  such  a  curve  using  this  strategy. 

B.  A  Strategy  For  Recovering  Three-Dimensional  Shape 

This  observation  about  human  perception  leads  to  the  fol¬ 
lowing  processing  strategy: 

(1)  Segment  the  image  contour  in  such  a  way  that  each 
segment  is  likely  to  comprise  a  projection  of  a  planar  segment 

of  the  space  curve. 

(2)  Calculate  the  planes  implied  by  the  segments  from  (1). 

(3)  Assemble  the  results  of  (2)  into  an  estimate  of  the  shape 
of  the  entire  spare  curve. 

The  specific  criteria  for  the  initial  segmentation  are  not  dealt, 
with  here.  It  is  clear,  however,  that  the  image  contour  should 
be  scgmculcd  at  singular  points  of  curvature  (maxima,  minima, 
and  inflection  poiuts).  Hoffman  and  Richards  [10]  have  presented 
a  theory  of  curve  segmentation  that  addresses  this  issue.  Our 
approach  will  be  to  temporarily  ignore  the  segmentation  problem 
and  to  simply  estimate  the  orientation  of  parts  of  the  space  curve 
from  many  local  parts  of  the  image  contour  If  valid  results  are 
forthcoming  with  this  approach  the  method  can  only  he  improved 
with  more  elaborate  segmentation. 

C.  Modeling  the  Space  Curve 

Wc  shall  model  a  space  curve  in  the  conventional  way,  as  a 
three-dimensional  vector  function  x(s)  of  one  parameter  s  which 
is  assumed  to  be  a  natural  parameter,  i.e.,  |rfx(*)/rfa|  —  1.  The 
shape  of  such  a  curve  is  completely  determined  by  two  properties 
that  arc  scalar  functions  of  »:  curvature,  *(«),  and  torsion  r(«)  [9], 
Curvature  is  always  nounegative;  only  straight  lines  and  inflection 
poiuts  have  zero  curvature.  Torsion  may  be  intuitively  defined  as 
the  amount  of  “twist"  in.  the  curve  at  a  point  ».  Another  way  to 
visualize  torsion  is  as  the  degree  to  which  the  osculating  plane 
(the  plane  which  contains  the  curve)  is  changing.  Only  planar 
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curves  hive  zero  torsiou  everywhere  Unlike  curvature,  torsion 
may  he  either  negative  or  positive 

The  presence  of  torsion  is  not  directly  evident  in  the  image. 

It  simply  result'  id  more  or  less  foreshortening  as  the  osculating 
plane  of  the  contour  varies.  The  effects  of  torsion,  therefore,  cau 
be  exactly  mimicked  by  changes  in  curvature,  and  vice  versa. 

II  Theory  of  Contour  Interpretation 

Not  all  three-dimensional  interpretations  of  an  image  con¬ 
tour  are  equally  likely.  If  we  assume  that  spatial  isotropy  holds, 
theu  we  know  that  viewer  position  is  independent  of  the  shape  of 
the  curve  which  allows  us  to  make  a  reasonable  guess  about 
the  latter  s  three-dimensional  shape  [8]  The  first  step  towards  a 
guess  at  the  space  curve’s  shape  is  the  following  proposition. 

Proposition  (Zero  Torsion).  The  maximum-likelihood 
estimate  of  the  torsion  of  the  space  curve  is  sero  (i.e.,  no 
“twisting”  of  the  curve). 

Thi*  proposition  follows  because  the  assumption  of  spatial 
isotropy  implies  that  the  viewer's  position  and  the  shape  of  the 
spare  curve  are  mutually  independent.  Thus,  not  only  is  it  un¬ 
likely  that  significant  features  of  the  curve  will  be  hidden  from 
view  by  coincidental  alignment  of  the  viewer  and  the  curve,  but, 
conversely  it  is  likely  that  the  viewed  scene  will  not  change  much 
with  small  changes  in  viewing^^position.*  The  appearance  of  a 
curve  with  substantial  torsion  will  change  considerably  with 
small  changes  in  viewer  position;  if  we  assume  spatial  isotropy, 
therefore,  we  must  expect  that  the  torsion  of  the  curve  will  be 
small 

Furthermore,  given  that  spatial  isotropy  implies  that  the 
viewer  position  and  the  shape  of  the  curve  are  mutually  inde¬ 
pendent,  the  torsion  of  the  curve  must  then  also  be  independent 
of  viewer  position.  Consequent I v,  the  torsion  of  the  curve  is  as 
likely  to  be  positive  negative,  »nd  thus  the  mean  value  (and 
maximum-likelihood  estimate)  for  the  magnitude  of  the  torsion 
is  zero*  .  The  probability  that  the  torsion  is  small  implies  this 
estimate  will  generally  he  a  good  one. 

A.  Estimation  With  The  Assumption  Of  Zero  Torsion 

L'ven  if  we  assume  that  torsion  is  tero  (i.e.,  the  space  curve 
is  planar),  there  is  still  a  two-parameter  set  of  space  curves  that 
could  have  generated  that  imaged  contour.  The  two  parameters 
correspond  to  the  two  degrees  of  freedom  of  the  osculating  plane. 

Assume  that  we  are  given  a  small  portion  of  an  imaged 
contour,  and  asked  to  estimate  the  three-dimensional  shape  of 
the  spare  curve  which  generated  that  image.  If  we  measure  the 
position  and  curvature  at  three  points  on  the  imaged  contour, 
then  we  can  unicpiely  define  an  elliptical  are  that  fits  the  image 
data.  By  the  previous  proposition,  this  elliptical  arc  is  most  likely 
caused  by  a  space  curve  that  is  either  an  arc  of  a  circle  or  of  an 
ellipse,  as  those  are  the  two  planar  (tero  torsion)  shapes  which 
can  project  to  an  ellipse1*  . 

Previous  research  ((2),  (12))  has  shown  that  the  maximum- 


*This  is  often  referred  to  as  the  assumption  of  general  position. 
Thus,  spatial  isotropy  implies  general  viewing  position. 

As  a  function  of  position  on  the  image  contour  rather  than  as 
a  function  of  a 

*i\'ote  that  at  places  where  the  curvature  is  tero  —  straight 
segments  and  inflection  points  —  the  torsion  is  not  defined  and 
may  arbitrarily  he  taken  to  he  tero  That  is,  the  osculating  plane 
may  he  changed  freely  at  these  points  without  affecting  the  shape 
of  the  space  curve. 

"This  is  true  of  both  perspective  and  orthographic  projection, 
however  we  will  deal  exclusively  with  the  more  general  case  of 
perspective  foreshortening. 


likelihood  estimate  of  the  space  curve’s  shape  is  given  by  the 
following  proposition  (see  also  [2])r 

Proposition  (Planar  Interpretation).  Given  an  ellip¬ 
tical  segment  of  an  image  contour  and  that  the  space 
curve  is  planar,  the  maximum  likelihood  estimate  of  the 
space  curve’s  th.'ete-dimensional  shape  is  a  segment  of  a 
circle, 

Barnard  f 1 2]  has  constructed  a  maximum  entropy  estimator 
that  implements  this  proposition  for  perspective  images  and  that 
is  tolerant  of  digitization  noise.  Operating  under  the  assump¬ 
tion  that  the  space  curve  has  tero  torsion,  it  chooses  the  orienta¬ 
tion  that  maximizes  the  entropy  of  backprojected  image  contour 
curvature  measurements.  That  is.  curvature  is  first  measured 
at  several  points  in  the  image  contour,  then  the  curvatures  of 
hypothetical  planar  space  curves  of  essentially  all  orientations  arc 
computed  by  backprojection,  and,  finally,  the  orientation  that 
leads  to  the  space  curve  of  most  uniform  curvature  (in  the  sense 
of  maximum  entropy  )  is  selected.  In  general,  three  image  con¬ 
tour  curvature  measurements  are  sufficient  for  an  unambiguous 
maxi  mum -entropy  interpretation  (up  to  a  Necker  reversal). 

IH  Three-Dimensional  Estimation 

Now  let  us  return  to  the  general  problem  of  estimating  the 
shape  of  the  space  curve,  given  a  smooth  imaged  contour.  Let 
us  tirst  take  three  curvature  measurements  along  the  imaged 
contour.  These  three  measurements  define  an  ellipse  As  just 
described,  this  leads  to  a  circular  interpretation  of  the  space 
curve  Now  suppose  that  we  have  additional  image  contour  cur¬ 
vature  measurements.  There  are,  then,  two  cases  to  consider: 

First  case:  the  new  points  fit  on  the  lime  ellipse.  In 

the  first  case  we  have  quite  strong  evidence  of  the  space  curve’s 
shape.  For,  if  the  osculating  plane  were  changing,  the  curvature 
would  have  to  be  changing  also  —  and  in  just  such  a  manner 
as  to  exactly  cancel  (in  the  image)  the  effect  of  the  changing 
osculating  plane.  Similarly,  if  the  curvature  of  the  space  curve 
were  elmngiug,  the  osculating  plane  would  have  to  change  just 
exactly  enough  to  cancel  the  effect  of  the  changing  curvature. 
As  such  a  “conspiracy"  to  cancel  the  visible  effects  of  change  is 
unlikely  (a  direct  violation  of  general  position),  we  must  conclude 
that  there  was  neither  torsion  nor  change  in  curvature,  and,  thus, 
there  is  a  great  (in  fact,  maximum)  likelihood  that  the  new  image 
curvature  measurements  result  from  the  same  circular  space  curve 
defiued  by  the  first  three  measurements. 

Second  case:  the  new  points  don’t  fit  on  the  tame 
ellipse.  What  if  the  additional  measurements  lie  off  the  ellipse 
defined  by  the  first  three  measurements?  Then  we  can  be  certain 
that  cither  the  curvature  or  the  osculating  plane  (or  both)  of 
the  space  curve  has  changed  This  new  point  is,  therefore,  a 
possible  phicc  to  segment  the  curve.  What  we  must  do  when  we 
encounter  such  a  point  is  advance  along  the  image  contour  until 
wc  are  completely  past  the  point,  and  obtain  anew  estimate  of 
the  space  curve's  osculating  plane.  If  the  new  osculating  plane 
lias  the  same  orientation  as  the  previous  osculating  plane,  then 
we  have  evidence  that  the  space  curve  continues  to  be  planar, 
and  wc  should  uot  segment  the  curve.  If,  however,  we  obtain 
a  different  orientation  for  the  osculating  plane,  then  we  should 
segment  the  space  curve  and  begin  a  new  planar  segment  of  the 
curve. 

As  any  smooth  image  contour  may  be  closely  approximated 
by  portions  of  ellipses  and  straight  lines*  ,  this  interpreta¬ 
tion  strategy  will  yield  a  single  interpretation  for  the  tbree- 

*Only  the  third  and  higher  derivatives  of  the  imaged  contour 
that  will  fail  to  be  exactly  matched.  People,  it  should  be  noted, 
are  very  poor  obsi  rvers  of  changes  in  the  third  derivatives  of  an 
image  contour. 
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dimensional  shape  of  the  space  curve  (up  to  N'ecker  reversals). 
Further  tins  interpretation  will  be  the  most  likely  interpreta¬ 
tion  on  a  point-by-point  hasis.  It  should  be  noted  that  the  first 
two  steps  of  this  estimation  strategy  are  similar  to  the  strategy 
proposed  in  [lj. 

IV  An  Example 

The  interpretation  strategy  has  been  implemented  and  ap¬ 
plied  to  a  synthetic  image  of  a  helical  space  curve.  The  helix 
example  is  a  good  test  because  a  helix  has  significant  torsion 
everywhere  thus  distinguished  segmentation  points  do  not  ex¬ 
ist  and  it  is  not  clear  what  the  estimation  strategy  will  do.  If 
we  can  recover  the  helical  shape  of  the  space  curve  with  some  ac¬ 
curacy  wc  shall  have  demonstrated  that  the  estimation  strategy- 
can  perform  even  when  no  good  segmentation  is  available. 

Figure  2  (a)  shows  a  perspective  image  of  a  helix.  Figure 
2  (b)  shows  a  plot  of  the  spherical  indicatrix  of  the  helix.  The 
spherical  indicatrix  is  a  plot  of  the  orientation  of  the  osculat¬ 
ing  plane  of  die  space  curve.  The  axes  in  this  plot  corresponds 
to  tLc  azimuth  and  elevation  of  the  osculating  plane  As  men¬ 
tioned  previously,  knowledge  of  the  orientation  (azimuth  and 
elevation)  of  the  osculating  plane  at  each  point,  together  with 
the  imaged  contour,  uniquely  determines  the  shape  of  the  space 
curve.  Thus,  the  spherical  indicatrix  is  a  method  of  displaying  the 
three-dimensional  shape  of  the  space  curve.  Figure  2  (e)  shows 
the  .spherical  indicatrix  estimated  for  the  contour  in  (a).  VVTicn 
this  is  compared  with  the  actual  indicatrix  shown  in  (b),  it  is 
evident  that  the  three-dimensional  shape  of  the  space  curve  haa 
hccu  fairly  accurately  recovered. 

Summary.  Wc  have  developed  a  theory  for  assigning  a 
three-dimensional  interpretation  to  any  rmooth  image  contour. 
The  theory  has  been  implemented  and  is  undergoing  evaluation, 
which  may  lead  to  further  development.  The  results  reported 
above  indicate  that  the  estimation  strategy  performs  reasonably 
well  even  for  cases  such  as  a  helix,  where  the  presence  of  substan¬ 
tial  torsion  might  have  led  one  to  expect  poor  performance. 
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Abstract 


A  method  for  extracting  the  motion  parameters  of 
several  independently  moving  objects  from  displacement  field 
information  is  described.  The  method  is  based  on  a 
generalized  Hough  transform  technique.  Some  of  the 
problems  of  this  technique  are  addressed  and  appropriate 
solutions  are  proposed.  A  modified  multipass  Hough 
transform  approach  has  been  implemented,  where  in  each 
pass  windows  are  located  around  objects  and  the  transform 
is  applied  only  to  the  displacement  vectors  contained  i-> 
these  windows.  The  windows  are  determined  by  the  degree 
to  which  the  displacement  field  is  locally  inconsistent  with 
previously  found  motion  transformations.  Thus,  the 
sensitivity  of  the  Hough  transform  to  local  events  is 
increased  and  the  motion  parameters  of  small  objects  can 
be  detected  even  in  a  noisy  displacement  field.  We  also 
use  a  multi-resolution  scheme  in  both  the  image  plane  and 
the  parameter  space  and  thus  reduce  the  computational  cost 
of  the  technique.  The  method  is  demonstrated  by 
experiments  based  on  artificial  images  with  four  parameters 
of  2-D  motion:  rotation,  expansion  and  translation  in  both 
axes. 


1.  Introduction 

A  time-varying  scene  may  contain  several 
independently  moving  objects  with  unknown  location,  shape 
and  3-D  structure.  The  interpretation  of  such  a  scene 
includes  the  computation  of  the  motion  parameters  of  the 
camera  and  each  moving  obje-t.  This  information  is  useful 
in  areas  such  as  robotics  and  navigation.  It  could  also  be 
used  as  an  intermediate  stage  for  achieving  the  tasks  of 
object -surround  separation  and  structure  determination. 

Our  approach  for  recovering  the  motion  parameters  is 
based  on  two  phases.  First,  we  compute  a  displacement 
field,  composed  of  vectors  describing  the  displacement  of 
image  elements  from  one  image  to  the  next  (see  section  2). 
In  this  paper  we  assume  a  dense  displacement  field,  but  the 
second  phase  is  basically  independent  of  this  assumption. 
F  ach  displacement  vector  is  assigned  a  weight  representing 
its  reliability 

in  the  second  phase  the  displacement  field  is 
interpreted  and  the  motion  parameters  are  recovered.  This 
phase,  which  is  the  main  concern  of  the  paper,  is  based  on 
the  generalized  Hough  transform  technique  [BAL81a].  In 


this  technique  the  motion  parameters  are  represented  by  a 
discrete  multi-dimensional  parameter  space  where  each 
dimension  correspond  i  to  one  of  the  parameters.  Each  point 
in  this  space  uniquely  characterizes  a  motion  transformation, 
defined  by  the  corr^'pcnding  parameter  values  A 
displacement  vector  ’’votes”  for  a  point  in  the  space  if  the 
corresponding  transformation  is  consistent  with  this  vector. 
The  points  receiving  the  most  votes  are  likely  to  represent 
the  motion  parameters  of  different  objects. 

There  are  a  few  i.ehniques  described  in  the  literature 
which  use  the  Hough  transform  for  dealing  with  scenes 
containing  several  moving  objects.  Fennema  and  Thompson 
[FEN??1  compute  spatial  and  temporal  gradients  of  the 
image.  A  Hough  transform  technique  is  used  to  detect 
velocities  which  are  consistent  with  a  significant  portion  of 
the  gradient  field.  A  multipass  approach  is  used:  first  the 
most  prominent  peak  in  the  Hough  transform  is  found  and 
thus  the  velocity  of  the  largest  object  is  reco-  ered.  Then 
the  image  points  which  are  consisted  with  this  velocity  are 
removed  and  a  new  peak  is  locked  for  The  process  is 

repeated  until  no  further  objects  are  found.  This  system  is 

restricted  to  translation.  It  also  has  problems  in  recognizing 
significant  peaks  [TH081], 

Ballard  and  Kimball  [BALSlb]  consider  the  case  of 
general  3-D  motion  of  rigid  objects,  but  assume  knowledge 
of  depth  information.  A  Hough  transform  technique  for 
computing  the  motion  parameters  from  3-D  optic  flow  is 
implemented.  The  simulation,  as  described  in  their  report, 
assumes  only  one  moving  object,  but  it  is  argued  that  a 

multipass  approach  would  handle  the  case  of  several  moving 

objects. 

Jayaramurthy  and  Jain  [JAY82]  describe  an 

implementation  of  the  Hough  transform  technique  for 

computing  motion  parameters  directly  from  the  intensity 
information.  Several  moving  objects  are  allowed,  but  a 
stationary  background  and  translational  motion  are  assumed. 

One  of  the  well  known  advantages  of  the  Hough 
technique  is  its  relative  insensitivity  to  noise  and  partially 
incorrect  or  occluded  data.  Another  advantage  is  its  ability 
to  detect  consistency  in  the  image.  In  cur  case  it  can 
group  together  displacement  vectors  which  satisfy  the  same 
motion  parameters  and  presumably  belong  to  one  object 

On  the  other  hand,  the  Hough  technique  has  a  few 
disadvantages.  It  is  insensitive  to  spatial  relations  in  the 
displacement  field.  Thus,  a  group  of  nor-adjacent  elements, 


which  mcidently  vote  for  the  same  motion  transformation, 
may  be  considered  as  representing  one  object,  whereas  the 
motion  parameters  of  a  small  object  may  be  difficult  to 
detect  The  technique  also  has  high  computational  cost. 
Fine  resolution  in  the  parameter  space,  which  is  related  to 
the  accuracy  of  the  final  results,  requires  large  amounts  of 
memory  and  computation  time. 

This  paper  addresses  these  problems.  A  few  ideas  are 
examined  in  a  restricted  case  of  2-D  motion  with  four 
parameters  (rotation,  expansion  and  translation  in  both 
axes).  An  analysis  of  reliability  and  efficiency  considerations 
is  presented  (section  32)  and  new  solutions  are  proposed 
(section  4).  A  modified  multipass  Hough  transform 
approach  has  been  implemented,  where  in  each  pass 
windows  are  located  around  objects  and  the  transform  is 
applied  only  to  the  displacement  vectors  contained  in  these 
windows.  The  windows  are  determined  by  the  degree  to 
which  the  displacement  field  is  locally  inconsistent  with 
previously  found  motion  transformations.  Thus,  the 
sensitivity  of  the  Hough  transform  to  local  events  is 
increased  and  the  motion  parameters  of  small  objects  can 
be  detected  even  in  a  noisy  displacement  field.  We  also 
use  z  multi-resolution  scheme  in  both  the  image  plane  and 
the  parameter  space  and  thus  reduce  the  computational 
cost  of  the  technique.  These  ideas  are  demonstrated  by 
experiments  based  on  artificial  images  (section  5). 

2.  Computing  a  Displacement  Field  and  a  Weight  Plane 

In  the  first  phase  of  the  algorithm  we  compute  a 
displacement  field  from  two  sampled  images.  These  images 
contain  several  objects  which  are  moving  independently.  The 
background  is  considered  as  one  of  the  objects.  The 
motion  of  each  object  is  composed  of  rotation,  expansion 
and  translation.  It  can  be  represented  by  the  following 
affine  transformation: 

(2.1)  i'  =  (l4expan)[cos(rot)  i-sin(rot)  j]4trj 

(2.2)  j'  =  (l+expan)[sin(rot)i+cos(rot)  j]+tr2 

where  (i,j)  is  a  pixel  in  the  first  image,  (i',j')  is  the 
corresponding  pixel  in  the  second  image  and  rot,  expan,  trj 
and  tr2  are  the  motion  parameter  values. 

The  displacement  field  can  be  described  by 
{(Dj(i,j),  D2(i,j))}  where  (Dj(i,j),  D2(i,jj)  represents  the 
displacement  vector  at  the  (i,j)  pixel  in  the  first  image. 
We  compute  it  by  using  the  Horn  and  Schunck  technique 
[HOR80]  (however,  the  second  phase  of  our  algorithm  is 
almost  independent  of  this  specific  choice).  In  order  to  use 
this  tehnique  we  assume  a  small  displacement  at  each  pixel 
and  absence  of  illumination  effects.  It  starts  by 
calculating,  at  each  pixd,  the  spatial  gradient  (Ej,  E2)  and 
the  temporal  derivative  Ej.  The  assumption  that  the 
brightness  of  a  particular  point  in  the  scene  is  constant  over 
time  provides  the  following  constraint: 

(23)  E1D1+E2D2+Et  =  0 

The  assumption  of  the  smoothness  of  the  displacement  field 
provides  another  constraint 


An  error  function  can  represent,  for  a  given 
displacement  field,  the  degree  of  departure  from  these 
constraints.  The  technique  is  based  on  iteratively 
minimizing  this  function.  Ideally,  the  resulting  field 
(Dj,  Do)  should  satisfy  the  following  equations,  derived 
from  equations  (2.1)  and  (2.2): 

(2.4)  i+Dj(i,j)  =  (l+expan)[cos(rot)  i  -sin(rot)  j]+trj 

(23)  j+D2(i,j)  =  (l+expan)[sin(rot)  i+cos(rot)  j]+tr2 

where  rot,  expan,  trj  and  tr2  are  the  motion  parameter 
values  in  the  (i,j)  pixel. 

Figure  1  shows  two  pairs  of  artificial  images  which 
contain  several  independently  moving  objects.  The  motion 
parameters  of  each  object  are  specified  in  tables  5.1  and 
5.2.  Figure  2  shows  the  result  of  applying  the  Horn  and 
Schunck  technique  to  these  images. 

The  smoothness  constraint  is  violated  at  the 
boundaries  of  independently  moving  objects.  Therefore,  the 
computed  displacement  values  in  these  areas  are  incorrect. 
Fortunately,  these  areas  can  be  detected  by  using  the  error 
function  which  represents  the  departure  from  the  constraints. 
High  values  of  the  error  function  indicate  that  the 
constraints  are  not  satisfied  and  the  computed  displacement 
values  are  unreliable. 

For  each  displacement  vector  we  compute  an 
associated  weight  such  that  high  reliability  (low  value  of  the 
error  function)  is  represented  by  a  value  close  to  1  and  low 
reliability  by  a  value  close  to  0.  An  appropriate  relation 
between  the  error  function,  eif(i,j),  and  the  weight,  W(i,j), 
can  be  obtained  by  the  function 

(2.6)  W(i,j)  =  e-€rf(i'i)/k 

The  parameter  k  was  experimentally  determined  as  0.07. 
However,  this  value  need  to  be  decreased  with  noisier 
data.  Figure  3  shows  the  weight  planes  computed  for  the 
displacement  fields  in  figure  2.  When  the  Hough 
transform  is  computed  later  the  influence  ('voting'  power) 
of  each  displacement  vector  will  be  proportional  to  its 
associated  weight. 

3.  The  Generalized  Hough  Transform  Technique 
3.1  General  Description 

The  motion  parameters  can  be  represented  by  a 
4-dimensional  parameter  space  where  each  dimension 
corresponds  to  one  of  the  motion  parameters:  rotation  (rot), 
expansion  or  contraction  (expan),  vertical  translation  (trj) 
and  horizontal  translation  (tr2).  Each  pcint  in  this  space 
uniquely  characterizes  a  motion  transformation  in  the  image. 

We  say  that  a  displacement  vector  (Dj(i,j),  D^ij))  is 
consistent  with  a  point  (rot, expan, trj, tr2)  in  the  parameter 
space  if  it  satisfies  equations  (2.4)  and  (23).  Let  us  define 
a  subset  B(i,j)  of  the  parameter  space  as  the  set  of  all  the 
points  in  this  space  which  are  consistent  with 
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Figure  l.a  Intensity  images  used  in  the  first  experiment 
(the  whi'e  lines  only  emphasize  the  contours  of  the  objects 
and  are  not  part  of  the  images); 

-  object  A  is  the  background, 

-  object  B  is  the  large  circle  in  the  center  of  the  image, 
object  C  is  the  small  circle  in  the  upper  left  comer. 


Figure  l.b:  Intensity  images  used  in  the  second  experiment; 

-  object  A  is  the  background, 

-  object  B  is  the  circle  in  the  upper  right  comer, 

-  object  C  is  the  circle  which  partially  occludes  object  B, 

-  object  D  is  the  circle  in  the  left  part  of  the  image, 

-  object  E  is  the  small  circle  in  the  lower  part. 


(Dj(i,j),  D2(i  j».  Using  the  definition  in  [BALSla],  the 
Hough  transform  is  the  following  function,  defined  on  the 
paiameter  space 

(31)  H(rot, expan, trj  tr2)  =  I  W(i,j) 

(rot, expan, trj,tr2)  €  B(i,j) 

i.e  ,  H(rot, expan, trj,tr2)  is  the  sum  of  the  associated  weights 
of  ?11  the  displacement  vectors  which  are  consistent  with  the 
point  (rot, expan, trj,tr2).  High  values  of  the  Hough 
transform  represent  motion  transformations  which  are 
consistent  with  a  significant  portion  of  the  vectors.  The 
use  of  the  weight  function  W  should  prevent  unreliable 
values  of  displacement  vectors  ftom  creating  false  peaks. 


In  practice,  we  assume  a  limited  velocity  of  objects, 
i.e.  minimal  and  maximal  values  for  each  parameter  The 
corresponding  intervals  are  quantized  and  thus  each 
parameter  is  represented  by  a  discrete  set  of  values  The 
parameter  space  is  the  cartesian  product  of  these  sets 

For  each  displacement  vector  (Dj(i,j),  D^i,  j))  aod 
each  pair  (rot, expan)  of  rotation  and  expansion,  there 
exists  exactly  one  pair  of  translations  (trj',tr2*)  which 

satisfies  equations  (2.4)  and  (2.5).  If  trj*  and  tr2*  are 
within  the  limits  of  the  respective  dimensions  of  the 
parameter  space,  then  we  can  find  exactly  one  pair  (tri,tr2) 
such  that  trj  and  tr2  are  sampled  values  and 


(3.2)  tiyies/2  is  trj*  <  trjfres^ 

(3.3)  ti2-res/2  -s  tri  <  tr2+res/2 

where  res  is  the  resolution  of  the  translation  variables  in 
the  parameter  space  In  this  case  we  say  that 
(Dj(i,j),  D2(i,j))  votes  for  (rot, expan, trj.t^),  i.e  ,  it 
contributes  its  weight  to  H(rot, expan, trj.t^). 

Finally,  among  the  points  whose  Hough  transform 
exceeds  a  given  threshold  in  the  parameter  space,  local 
maxima  are  found  These  represent  the  hypothetical 
motions  of  objects  in  the  image 


1341  SNR  =  no  v0,es  f°r  fbe  object  motion 

'  '  ‘  average  no.  of  votes  for  each  parameter  value 

(for  a  different  definition  see  [BROS2])  If  the  SNR  is  low, 
then  false  peaks,  higher  than  the  value  corresponding  to  the 
object,  can  be  created.  Thus  the  detection  of  the  object  s 
motion,  by  a  straightforward  Hough  technique,  may  be 
difficult  or  impossible. 

Let  us  assume  that  the  multiplicative  parameters  of 
rotation  and  expansion  are  quantized  into  pj  elements 
each  and  that  the  translation  parameters  are  quantized  into 


3.2  Reliability  and  Efficiency  Considerations 


The  resolution  of  the  parameter  space  should  be 
determined  by  a  few  considerations:  the  signal  to  noise  ratio 
(SNR),  the  required  accuracy,  the  computation  time  and  the 
required  storage  space 

For  each  independently  moving  object  in  the  scene, 
the  SNR  in  the  parameter  space  can  be  defined  as: 
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Figure  2  Samples  of  the  displacement  fields  (a)  First 
experiment  (b)  Second  experiment. 


Figure  3:  Weight  planes.  Note  the  correspondence  between 
low  values  (represented  by  darker  gray  levels)  in  the  weight 
planes  and  incorrect  values  of  displacement  vectors  in  the 
boundaries  of  the  objects,  (a)  First  experiment,  (b)  Second 
experiment. 
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P2  piemen ts  each.  Then  the  parameter  space  includes  Pj^p^ 
elements  Let  n  be  the  number  of  displacement  vectors 
which  are  considered  in  the  computation  of  the  Hough 
transform  According  to  the  voting  process  described  in 
section  (3.1),  for  each  displacement  vector  and  each  pair 
(rot, expan),  there  exits  at  most  one  pair  (trj,tr2)  cf 
translations  such  that  the  displacement  vector  votes  for 
(rot, expan  trj,tr2)  Assuming  that  trj  and  tr2  arc  likely  to 
be  within  the  limits  of  the  respective  dimensions  (and  that 
is  the  case  in  our  experiments)  we  can  estimate  the 
average  number  of  votes  for  each  parameter  value  as 

nPl2;(Pl2P22)  =  n-P22  If  c  represents  the  fraction  of  the 
image  covered  by  an  object,  then  it  contains  cn 
displacement  vectors,  where  0<c£l.  Thus,  we  can  estimate 
the  SNR  by: 


(3-5) 


SNR  = 


a^2 


•*2 


If  for  reliable  detection  of  the  object,  the  SNR  should 
be  larger  than  some  threshold  t,  then  p2  should  satisfy  the 
constraint  p2  a  Vt/c.  For  example,  if  t=10  and  c=0.01 
then  p2  should  be  at  least  32.  This  observation  also 
indicates  that  for  a  given  p2,  the  motion  transformation  of 
a  small  object  may  be  difficult  to  detect. 


Therefore,  it  is  reasonable  to  quantize  the  parameter  space 
in  such  a  way  that  mtj  =  t2,  where  c2  is  the  resolution  of 
translation.  Consequently,  if  m  is  multiplied  by  4,  for 
example,  then  pj  should  be  multiplied  by  2  and  p2  should 
be  divided  by  2. 

4.  Computing  Motion  Parametr-s  from  Displacement  Field 
Information 

4.1  Key  Ideas 

The  proposed  method  is  intended  to  reduce  the 
problems  mentioned  in  the  last  section  and  to  test 
mechanisms  for  solving  such  problems  for  even  more 
difficult  tasks,  e.g.  recovering  the  motion  parameters  cf  3-D 
motion  with  six  degrees  of  freedom. 


The  key  ideas  which  are  used  for  accomplishing  this 
goal  are  the  following: 

1)  Given  a  large  displacement  field  (such  as  the 

128x128  array  in  the  experiments),  we  can  compute  the 
motion  parameters  of  large  objects  by  using  a  coarse 

resolution  field.  Such  a  field  can  be  obtained  by 

uniformly  sampling  the  initial  field  In  this  way,  we  can 

considerably  reduce  the  computation  time 


The  second  consideration  is  the  required  accuracy. 
The  parameter  resolution  can  be  dynamically  modified  to  fit 
the  expected  constraints  of  the  task  domain.  If,  for 
example,  the  maximal  value  of  rotation  is  1/8  radian,  the 
minimal  value  is  1/8  radian  and  the  required  resolution  is 
1/128  radian,  then  pj  should  be  at  least  32. 

The  third  consideration  is  computation  time. 
Computationally,  the  most  expensive  process  is  the  voting 
process.  We  saw  in  section  (3.1)  that  the  basic  operation  in 
this  process  is  computing  trj,  tr2  for  each  displacement 
vector  and  each  pair  (rot, expan).  Therefore,  the  voting 
process  takes  approximately  snpj2  time  units,  where  each 
basic  operation  takes  s  time  units. 

The  fourth  consideration  is  the  required  memory  for 
the  parameter  space  which  includes  p^pj2  elements.  If  we 

combine  the  requirements  for  high  SNR  and  high  accuracy 
we  may  have 

(3.<5)  pj^2  s  324  >  10000G0 

In  such  a  large  array,  finding  local  maxima  also  becomes  a 
computationally  expensive  task. 

Finally,  assuming  that  we  want  to  obtain  a  given 
accuracy  and  we  are  given  a  storage  space  with  a  fixed 
size,  the  optimal  values  of  pj  and  p2  can  be  determined. 

Let  us  suppose  that  the  image  contains  m2  pixels  and  that 
the  origin  of  the  coordinate  system  is  in  the  center  of  the 
image  Then,  using  a  resolution  of  in  the  multiplicative 
parameters  of  rotation  and  expansion  can  cause,  at  the 
boundary  of  the  image,  a  displacement  error  of  mej/2. 


2)  We  can  find  the  motion  parameters  of  a  given 
small  object  by  locating  a  window  around  the  object  and 
applying  the  Hough  transform  only  to  the  displacement 
vectors  contained  in  this  window.  Such  a  window  can  be 
located  by  using  a  multipass  approach  (see  next  section). 
By  focusing  our  attention  to  the  window,  we  increase  the 
proportion  of  the  vectors  contained  in  the  object,  i.e.,  we 
increase  c  in  equation  (35).  We  can  now  decrease  p2  and 
still  find  the  motion  parameters  of  the  object.  This 
technique  enables  us  to  detect  small  objects  and  save  time 
and  storage  space. 

3)  Even  with  a  limited  memory  size,  we  can  find 
accurate  parameter  values  by  iteratively  using  the  Hough 
technique.  In  each  iteration  we  quantize  the  parameter 
space  around  the  values  estimated  in  the  previous  iteration, 
using  a  finer  resolution.  Other  methods  for  reducing  the 
required  space  in  Hough  techniques  can  be  found  in 
[0R081,  SL081]. 

4.2  Description 

4.2.1  General 

The  algorithm  is  based  on  repeatedly  executing  a 
basic  cycle  of  operations.  The  input  to  each  cycle  includes: 

1)  A  list  L  of  motion  transformations  which  were 
computed  in  previous  cycles  (initially  L  is  an  empty  list). 

2)  A  binary  mask  array  A  in  registration  with  the 
displacement  field.  Each  element  in  A  is  either  1  or  0:  1 
if  the  corresponding  displacement  vector  is  consistent  with 
one  of  the  already  computed  motion  transformations;  0 
otherwise.  Initially  all  the  entries  in  this  array  are  set  to  0. 


Each  cycle  is  composed  of  the  following  steps: 

1)  locate  windows  in  the  image  which  contain 

relatively  dense  clusters  of  0-entnes  in  A  Initially  there  is 
one  window  consisting  of  the  whole  image 

2)  For  each  window  compute  the  Hough  transform 

and  hypothesize  (see  section  42’)  the  motion 

transformations 

3)  Test,  sequentially,  the  hypothesized  transformations. 
The  test  is  done  by  considering  the  0-entries  in  A  that  are 
contained  in  the  window,  and  summing  the  weights  of  the 
associated  displacement  vectors  which  are  consistent  with  the 
hypothesized  transformation  If  the  sum  is  higher  than  a 
given  threshold,  the  transformation  is  confirmed.  In  this 
case  it  is  added  to  the  list  L  and  the  array  A  h  updated 
correspondingly 

4.2.2  Locating  Windows 

A  window  can  be  described  as  a  set 
{(i ,j):  igSi<ij,  jg^jcji}.  For  implementation  reasons,  we 
consider  only  windows  such  that 

>0’  *1-  Jo-  Jl  £  (0,4, 8,. ..,128} 

and 

■rk)'  irio  €  {8,16,32,64} 

For  each  window,  we  define  a  criterion  function  CR  by: 

(4  1)  CR  =  no  of  ®"en,r*es  of  A  >n  t^e  window 
'  '  '  Varea  of  the  window 

We  look  for  windows  with  high  values  of  CR.  Such 
windows  contain  dense  clusters  of  0-entries  in  A.  The  use 
of  squate  root  in  the  denominator  of  equation  (4.1)  means 
that  this  density  can  be  lower  as  the  window  becomes 
larger.  If  we  would  eliminate  the  square  root  in  this 
equation,  then  too  small  windows,  which  contain  only 
0-entries,  would  be  chosen.  If  we  do  not  find  any 
appropriate  windows,  i.e.  windows  whose  criterion  function 
exceeds  a  given  threshold,  the  process  is  stopped.  The 
reason  is  that  probably  there  are  no  more  objects  whose 
motion  transformation  has  not  already  been  found. 

Figure  4  shows  the  windows  found  in  the  second 
cycle  when  the  method  is  applied  to  the  images  in  figure  1. 
Figure  5  shows  the  A  arrays  when  the  processes  are 
stopped.  The  black  areas  in  figure  5,  which  represent  the 
0-entries  in  the  A  arrays,  correspond  to  incorrect  values  of 
displacement  vectors  in  the  boundaries  of  the  objects. 

4.23  The  Hypothesizing  Phase 

Before  we  start  the  voting  process  of  the  displacement 
vectors  in  a  given  window,  we  have  to  decide  which 
vectors  take  part  in  this  process  and  how  the  parameter 
space  is  defined  If  the  window  contains  no  more  than 
1024  elements,  then  all  of  them  take  part  in  the  voting 
process  Otherwise,  for  efficiency  considerations  (see  section 
4.1),  we  will  utilize  a  uniformly  sampled  subset  of  1024 
elements.  For  example,  if  the  window  is  32  x64  elements,  we 
define  a  sub-array  of  32x32  elements  by  choosing  all  the 
elements  (i,j)  such  that  j  is  even. 


The  parameter  space  is  an  adjustable  4  D  array 
("adjustable"  means  that  the  number  of  elements  in  each 
dimension  is  not  fixed)  which  contains  17^  (  90000) 

elements.  We  assume  that  the  rotation  is  limited  to  0 125 
radians  in  each  direction,  the  expansion  (or  contraction)  is 
limited  to  0.125  and  the  translation  is  limited  to  8  pixels  m 
each  direction.  The  axes  which  correspond  to  rotation  and 
expansion  contain  pj  elements  each  and  the  axes  which 
correspond  to  the  translations  contain  P2  elements  each  If 
the  length  of  the  window  is  at  least  64  elements  then, 
according  to  the  argument  described  at  the  end  of  section 
3.2  for  equalizing  the  effective  parameter  resolutions,  we 
choose  pj  pj  17;  otherwise  pj  is  decreased  and  P2  is 
increased.  So,  for  example,  if  the  window  is  16x16,  we 
choose  pj=9  and  P2=31. 


Figure  4:  Optimal  windows  found  in  the  A  arrays  during 
the  second  cycle  of  the  experiments,  (a)  First  experiment 
(b)  Second  experiment. 


After  the  voting  process  is  finished,  local  maxima  in 
the  Hough  transform  are  determined.  From  these 
candidates,  the  ones  that  exceed  a  given  threshold  are 
selected  The  threshold  is  proportional  to  the  number  of 
all  the  voting  displacement  vectors.  If  the  resolution  of  the 
translation  axes  is  more  than  1/2  pixel,  we  define  a  new 
parameter  space,  around  each  maxima  point,  with  finer 
resolution  We  then  recompute  the  Hough  transform.  The 
process  is  repeated  until  we  achieve  a  resolution  of  1/2 
pixel  at  most  At  the  end  of  this  process  we  have  a  set 
of  hypothesized  transformations. 


•.A 


Figure  5:  Final  A  arrays,  (a)  First  experiment,  (b)  Second 
Experiment 


4.2.4  Hie  Testing  Phase 

In  this  phase  we  sequentially  test  the  hypothesized 
transformations  in  the  order  of  their  Hough  transform 
va)  es  in  the  parameter  space.  We  test  only  the 
transformations  which  are  still  not  included  in  the  list  L  of 
confirmed  transformations.  When  we  test  a  given 
transformation,  we  check  all  the  displacement  vectors  with 
associated  0-entry  in  the  corresponding  window  We  sum 
the  weights  of  such  vectors  which  are  consistent  with  the 
hypothesized  transformation.  We  also  compute  a  threshold 
proportional  to  the  number  of  (Gentries  in  A,  contained  in 
the  window.  If  the  sum  is  higher  than  the  threshold,  we 
accept  the  transformation,  add  it  to  the  list  L  and  update 
the  array  A.  In  the  current  implementation,  the  process  is 
stopped  if  we  do  not  accept  any  transformation  in  any  of 
the  windows. 

5.  Experiments 

We  performed  two  experiments  based  on  two  pairs  of 
128x128  artificial  images  (figure  1).  In  the  experiments  the 
objects  were  transformed  according  to  the  upper  values  in 
each  entry  in  tables  5.1  and  5.2.  The  lower  numbers  in 
these  entries  are  the  computed  parameters. 


1  1  rotation  1 

1  1  (radians)  1 

1  1  1 

expansion  1 

1 

1 

vertical 

translation 

(pixels) 

1  horizontal  1 

I  translation  I 

I  (pixels)  1 

1  object  A  1 

l 

1 

1  actual  1 

0.  1 

0.  1 

0. 

0.  I 

1  computed  1 

0.  1 

0.  1 

0. 

0.  I 

1  object  B  1 

1 

1 

1 

1  actual  1 

0.07  1 

0.  1 

0. 

0.  1 

1  computed  1 

0.0781  1 

0.  1 

0. 

0.  1 

1  object  C  1 

1 

1 

1 

1  actual  1 

0.  1 

0.  1 

0 

2.  1 

1  computed  1 

0.  1 

0.  1 

0. 

2.  1 

table  5.1  -  first  experiment 


In  the  first  stage  -  the  displacement  field  was 
computed  (figure  2).  Note  the  errors  at  the  boundaries  of 
the  objects  which  correspond  to  low  values  in  the  weight 
planes  (figure  3). 

In  experiment  1,  during  the  first  cycle  of  the 
algorithm  for  computing  the  motion  parameters,  the  motion 
transformations  of  objects  A  and  B  were  detected  (see  the 
results  in  table  5.1).  In  the  second  cycle,  two  windows  were 
located  around  areas  in  the  mask  array  A  with  relatively 
dense  clusters  of  0-entries  (figure  4),  but  only  the  window 
around  object  C  gave  a  positive  result  -  the  motion 
transformation  of  object  C.  In  the  final  cycle  no  appropriate 
windows  were  found  in  the  array  A  (figure  5). 
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Corresponding  results  from  experiment  2  are  also 
shown  in  figure  4  and  figure  5.  The  computed 
transformations  are  shown  in  table  5.2. 


I  I  rotation  I  expansion  I  vertical  I  horizontal 

I  i  (radians)  I  I  translation  I  translation 

III  I  (pixels)  I  (pixels) 


I  object  A  I  I  I  I 

I  actual  l  0.025  1  0  1  0.  10. 

I  computed  I  0.0234  10  I  0  I  0. 


I  object  B  I  I  I  I 

I  actual  I  0.  I  0.1  I  -1.5  I  0. 

I  computed  I  0.  I  0.09375  I  -1.2  I  0. 


I  object  Cl  I  I  I 

I  actual  1-01  10.  I  0  I  2.2 

I  computed  i  -0.09375  I  0.  I  -0.2  I  2.1 


l  object  D  I  I  I  I 

l  actual  I  0.12  I  -0.1  10.  10. 

I  computed  I  0.125  I  -0.0937  I  -0.2  I  0. 


I  object  LI  I  I  I 

I  actual  I  0.  I  0.125  1-1.1  I  07 

I  computed  I  0.  I  0.0625  (*)  I  -1.05  I  0.6 


table  5.2  -  second  experiment 

(*)  The  large  error  indicated  in  this  entry  is  due  to  the 
small  size  of  object  E  (radius  =  8  pixels)  which  reduces  the 
possible  resolution  in  the  measurements  of  rotation  and 
expansion. 

6.  Conclusions  and  Extensions 

This  work  demonstrates  an  efficient  and  robust 
algorithm,  based  on  the  Hough  technique,  for  recovering 
motion  parameters  in  scenes  containing  several  independently 
moving  objects.  An  hierarchical  approach,  combined  with  a 
windowing  scheme,  is  implemented  in  order  to  deal  with 
objects  of  different  size.  The  storage  space  and 
computation  time  can  be  limited,  while  still  computing  the 
motion  parameters  very  accurately  and  distinguishing 
between  real  objects  and  noise  effects. 

We  hope  to  extend  this  work  for  sequences  of  images 
and  for  recovering  the  3-D  motion  parameters  of  rigid 
objects.  However,  the  latter  task  is  much  more  difficult 

than  recovering  2-D  motion  parameters.  In  the  2-D  case 
each  vector  contributes  two  constraints  (equations  2.4  and 
2.5)  whereas  in  the  3-D  case,  assuming  that  depth 
information  is  unknown,  each  vector  contributes  only  or 
constraint.  Therefore,  the  signal  to  noise  ratio  in  the 

parameter  space  (section  3.2)  is  much  lower.  In  addition, 
we  expect  to  have  problems  of  ambiguity  in  the 

interpretation  of  noisy  displacement  fields,  where  a  group  of 
motion  transformations  can  be  equally  consistent  with  the 


data.  In  such  cases  a  probabilistic  approach  might  be  more 
suitable.  We  also  plan  to  implement  less  restricted  methods 
for  computing  a  displacement  field  or  other  equivalent 
information,  to  use  the  motion  information  for 
object-surround  separation,  and  to  test  the  method  in  real 
scenes. 
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ABSTRACT 

--  --ii Tying  abstract  mathematical  structure  is  presented 
Tor  a  number  of  vision  problems,  notably  stereo,  motion  stereo, 
optic  Mow,  ami  matching.  Ideas  from  modern  dilTerential  topol¬ 
ogy  are  presented  and  applied  to  the  general  matching  proh- 
lem,  a  common  approach  to  slcreo  matching,  defined  as  follows. 
Given  2  picture  functions  F\  ,1'*  :  A/2  — ►  Rn,  one  finds  regions 
fv  i ,  /v 2  C  A/2  and  a  l-l  matching  function  g„  :  K\  — *  /V2  such 
that  /' i  =  /' 2  op,.  It  is  shown  that  genetically  for  monochrome 
pictures  (n  =  1)  there  is  a  large  infinity  of  solutions,  but  for  2 
or  more  color  dimensions  (n  >  2)  the  solution  is  unique. 

I  he  paper  is  offered  partially  in  the  hope  of  introducing 
vision  workers  to  this  type  of  mathematics  and  persuading  them 
of  its  utility. 

,  el  INTRODUCTION  ; 

Analogously  to  the  Erlangcr  Programm,  the  task  of  com¬ 
puter  vision  dan  be  viewed  as  finding  invariants  of  irradiance 
functions  under  the  rigid  motion  group  of  R3.  This  paper 
describes  this  slrueturein  the  language  of  modern  abstract  math¬ 
ematics,  providing  a  framework  for  understanding  and  the  pos¬ 
sibility  of  applying  powerful  methods  to  resolve,  fundamental 
questions.  ■'Wejprovola  theorem  which  says  that  occlusion-free 
stereo  matching  requires  at  least  2  color  dimensions  or  some 
knowledge  of  the  imaging  geometry. 

For  reasons  of  space,  the  treatment  here  is  abridged  and 
terse.  The  interested  reader  can  find  a  more  complete  exposition, 
including  mathematical  details,  definitions,  and  wider  discussion 
in  [Dlicher  1082). 

1 1  TJIE. MATHEMATICAL  STRUCTURE 

I  he  structure  is  depicted  by  the  commutative  diagram 
f  ig.  (*').  The  object  surface  E  is  embedded  in  R3  via  i.  M 3 
is  a  fixed  fi  dimensional  subset  or  R3,  and  is  the  domain  of 
definition  for  the  imaging  projection  7r.  which  maps  it  to  M 2,  the 
2  dimensional  image  space.  />’,  is  the  observed  image  intensity 
on  some  closed  set  K |  of  the  image  plane.  S i  and  K\  are 
corresponding  visible  regions  or  E  and  the  image  space  A/2,  resp. 

We  assume  that  the  surface  E  admits  a  function 
F  :  E  — *  R"  which  descrihes  intrinsic  surface  features.  E.g., 
for  !'  :  E  — *  R1  (i.e.  n  =  I),  /■  represents  an  intrinsic  surface 
brightness  or  luminance.  Thus  we  ignore  the  effect  of  viewpoint 
on  image  irradiance,  or  equivalently,  we  take  the  reflectance  func¬ 
tion  to  be  constant.  To  the  extent  that  we  deal  only  with  small 
changes  in  viewpoint,  that,  will  usually  be  a  good  approximation. 


•This  work  was  supported,  in  part  by  ARPA  contracts  MDA503-80-C-0102 
and  N00039-82-C  0250. 


F  can  be  thought  of  as  the  intrinsic  surface  property  albedo ;  then 
our  analysis  deals  with  quantities  that  depend  only  on  albedo, 
io  good  approximation.  For  the  case  n  >  2,  we  have  in  mind 
color  images:  nor  nnl  human  rone  vision  incorporates  a’ function 
I' i  :  F )  — *  R3  ( n  =  3).  Note  that  we  also  subsume  rases  for  a 
smaller  (i.e.  n  =  2)  or  larger  (I  <  n  <  oo)  number  of  passbands, 
or  in  fact  any  surface  attribute,  such  as  a  multi-dimensional  tex¬ 
ture  measure,  which  can  be  thought  of  as  faking  pointvvise  values 
in  some  finite  dimensional  space. 


r3  — >  r3 

u  u 

M3  E  M3 

GO 

■|  v*.  1* 

V  >  v 

M2  M2 

R" 


Ax 


The  map  g  :  R3  — >  R3  is  a  rigid  motion  of  R3.  For 
many  surfaces  E,  different  viewpoints  g  have  different  domains 
of  visibility  of  E.  So,  in  general,  (7,  as  pictured  may  not  be 
well  defined,  since  we  cannot  be  sure  that  for  K\ ,  Kg,  S\,  S, 
as  we  have  defined  them,  that  5,  C  Sg,  or  equivalently  that 
tr  o  ig  :  E  — ►  M 2  is  1-1  on  S\.  I.e.,  part  of  what  we  sec  in  the 
picture  /■’,  might  be  hidden  from  view  when  we  look  after  doing 
g.  Hence  the  regions  I(u  Kg  must  be  chosen  in  such  a  way  that 
9*  is  well-defined,  l  or  example,  having  chosen  St,Sg  as  above, 
wc  can  define  S\  =  .S'  =  .S’,  nS,  and  K\  =  tt  o  i  (S\ )  and 
^ g  —  /r°!s(.S’').  Willi  these  restrictions,  gn  is  a  diffeoinorphism 
A,  —  Kg  with  the  property  that  F,(p)  =  Fg(q )  =  Fg(g„(p)), 
which  is  the  same  as  saying  that  g„  is  a  deformation  of  the 
picture  Ft  into  the  picture  Fg.  Note  that  this  observation  is  also 
equivalent  to  asserting  that  the  diagram  is  commutative  (for  the 
F\,g„,  Fg  loop). 

We  have  sidestepped  the  issues  or  occlusion,  shadowing 
and  photometry.  Nevertheless,  major  parts  of  the  following 
problems  arc  subsumed  in  the  structure  wc  have  presented. 


•  Area  correlation  stereo 

•  General  matching 

•  Motion  stereo  and  ortical  (low 

•  Feature  hased  stereo 

•  Singularity  tracking 

1 1 1  STEREO  BY  GENERAL  MATCHING 

As  a  straightforward  (but  not  trivial!)  application  of 
the  abstract  viewpoint  we  arc  proposing,  we  show  that  for 
monochrome  images,  the  matching  problem  is  insoluble,  and 
study  the  conditions  which  allow  unique  solution. 


y/Ft 

R" 

Fig.  (CM) 

A  common  approach  to  stereo  matching  is  via  general 
matching.  I.e.,  given  2  picture  functions  :  A/2  — *  R", 

one  finds  regions  A'qA'g  C  A/8  and  a  I- 1  matching  function 
gn  :  /\’|  — *  Ao  such  that  the  diagram  f  ig  (GM)  commutes.  Only 
after  the  matching  function  is  found  is  the  surface  embedding 
computed  by  associating  relative  depth  with  relative  disparity  at 
each  point  of  (say)  A'  |.  We  assume  that,  for  the  matching  phase 
no  information  about  imaging  geometry  is  used,  in  particular 
that  there  is  no  assumption  of  rectification,  and  no  knowledge 
of  epipolar  geometry  ir  used,  so  that  arbitrary  (but  sulliciently 
dilTerentiablc)  distortions  are  possible.  Since  we  are  concerned 
with  existence,  this  is  an  idealization  of  what  happens  in  practice, 
whore  usually  there  is  al  least  implicit  use  of  some  geometric 
constraints.  We  show,  in  fact,  that  such  use  is  necessary. 

The  following  question  then  arises: 

Problem  (Uniqueness  of  General  Matching).  If  we  seek 
an  arbitrary  (piecewise)  C[  dilTeomorphism  gn  to  make  Fig.(GM) 
commute,  when  are  we  guaranteed  a  unique  solution  to  the 
matching  problem? 

E.g.,  if  h'i ,  /' 2  arc  both  constant  functions,  i.e.,  we  have 
uniformly  gray  pictures,  the  problem  is  completely  degenerate, 
and  any  dilfeomorphism  gn  is  a  solution. 

We  do  not  consider  problems  related  to  occlusion,  and 
instead  assume  that  in  fact  if  is  possible  to  lind  regions  A | ,  Aa 
and  a  map  </„  which  fulfill  the  smoothness  criteria  and  make  Fig. 
(GM)  commute.  Our  concern  is  whether  gn  is  then  unique. 

Theorem  (2  color  theorem).  Stereo  requires  at  least  2 
color  dimensions  or  3  space  dimensions,  i.e.,  for  a  monochrome 
picture,  general  matching  has  infinitely  many  solutions,  but  for 
2  or  more  color  dimensions,  if  is  generally  unique.  Hence  the 
monochrome  case  requires  knowledge  of  the  imaging  situation  to 
constrain  the  problem.  (In  particular,  this  applies  to  gray-level 
correlation.) 

More  precisely,  consider  the  commutative  diagram  Fig. 
(GM)  where  gn  is  a  C 1  dilfeomorphism,  /‘j ,  /q  are  C1 ,  and 
A|,  Au  are  dilTerentiablc  submanifolds  of  R2. 

If  r)  =  I  (i.e.  the  picture  is  monochrome),  then  3  an 
infinite  dimensional  family  of  Cl  dilfeomorphisms  {A,,}  such  that 


replacing  gn  by  also  results  in  a  corn  imitative  diagram  (i.e. 
is  a  solution).  The  family  is  parametrized  by  (at  least)  the 
continuous  functions  A  ]  — *  R,  and  contains  an  isomorph  of  a 
neighborhood  of  the  identity. 

If  n  =  2  (i.e.  the  picture  has  2  color  dimensions),  then 
gcnerically  there  will  be  a  finite  number  id'  <j„  which  make  the 
diagram  commute  (note  we  have  assumed  that  such  a  g„  exists). 
If  wc  take  A j ,  /\  a  to  be  rectangles  or  discs  (as  in  a  usual  picture) 
then  gcnerically  there  is  a  unique  g„ . 

If  n  >  3  (i.e.  the  picture  has  at  least  3  color  dimen¬ 
sions),  then  gcnerically  there  will  be  a  unique  g„  which  makes 
the  diagram  commute. 

The  theorem  follows  easily  from  some  facts  in  differential 
topology.  (An  excellent  introduction  to  the  subject  is  [Guillemin 
1971],  ami  [llirsch  1976]  is  a  good  reference.) 

Proof  (case  71.  =  I:  monochrome  pictures).  The  idea 
of  the  proof  is  very  simple;  the  difficulty  lies  in  establishing 
when  it  is  valid.  The  idea  is  this.  Observe  that  if  l‘\  o 
y„  (/>)  =  /' j ( /<)  (i.e.  if  Fig.  (GM)  is  a  commutative  diagram)  then 
</„(/•  j  '(i))  =  /'2  '(*),  i.e.  (/„  takes  contour  lines  to  contour 
lines.  Conversely,  any  dilTeomorphism  li  :  K\  -*  A'y  which  fakes 
contour  lines  of  /'j  lo  contour  lines  of  /A.  satisfies  the  conditions 
for  g„.  Thus  any  taking  contour  lines  to  contour  lines  will 
solve  our  local  matching  problem.  But  how  many  such  gn’ s  can 
there  be?  Assume  for  the  moment  that  a  typical  contour  map 
contains  a  diffcomorphic  image  of  the  fragment  represented  by 
the  solid  lines  in  Fig.  (frag). 

If  i’  :  K 1  -*  AT  is  a  dilTeomorphism  leaving  contours  of 
/'j  invariant,  then  if  </„  is  a  matching  function  so  is  gnoip.  Define 
as  follows.  As  you  go  along  the  dotted  line 

7  :  I  A, 
t  1-*  7(f) 

in  Fig  (frag),  slide  each  contour  along  itself  by  an  angle  0(f).  As 
long  as  0  :  /  — ►  R  is  a  dilTeomorphism  onto  its  image,  the  map 
will  be  a  dilTeomorphism  in  a  neighborhood  of  the  dotted  line. 
To  the  extent  that  this  picture  is  valid,  there  will  be  as  many 
matchings  y„  o  ifi  as  there  are  such  maps  0. 


Fig.  (frag) 

.Actually,  wc  are  going  to  use  a  slightly  more  general  (and 
technical)  method  lo  construct  a  family  of  diffcomorphisrns  V’c. 
roughly  in  I-l  correspondence  with  the  set  of  all  CT  functions 
At  —  R.  For  this  we  will  use  a  canonical  vector  held  defined 
along  the  contour  lines  of  /' j,  which  will  fell  us  how  much  to 
slide  each  contour  line.  Deline  a  new  vector  Held  on  Aj  by 
rotating  each  of  the  local  vectors  of  V/q  by  +90°,  i.e.  +90° 
counterclockwise,  (which  is  uniquely  defined  because  we  have  a 
globally  defined  inne  r  product  011  an  oFontablc  manifold).  One 
might,  e.g.  define  the  new  vector  field  7  on  K\  by  Z[p)  =  (~b,a) 
if  V/'V(p)  =  (a ,6).  Note  that  7,  ■  VF\  —  0  at  all  p.  Since 
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smoothness  is  defined  with  respect  to  coordinates,  7  has  the  same 
degree  of  smoothness  as  V/’[.  Furthermore,  wherever  Z  0,  it 
is  tangent  to  the  contour  lines  of  Fj ,  so  that  the  orbits  of  Z  are 
exactly  those  contour  lines,  and  the  critical  points  arc  exactly 
the  critical  points  of  V !\  We  now  consider  the  flow  generated 
by  the  vector  held  7. 

Near  the  boundary  of  A|,  the  tiine-one  map  of  this  flow 
tnay  not  be  defined  if  a  contour  line  has  a  boundary.  However, 
this  is  easily  overcome  by  using  a  “hump”  function  [Abraham 
1978]  /?  :  A'|  -*  R  to  get  a  vector  field  f)  ■  X  on  K,  which 
smoothly  goes  to  zero  very  close  to  the  boundary,  and  hence  has 
a  flow  ipt  which  never  leaves  A'i. 

Thus  for  each  (,  :  A  i  —»  K\  is  a  dilTeomorphism  on  A'i 

leaving  contour  lines  invariant.  This  family  of  diffeoinorphisins 
can  be  enlarged  even  more.  Notice  that  multiplying  the  vector 
field  Z  by  a  scalar  (7  function  p :  A  |  — *  R  does  not  alter  orbits. 
Therefore  we  can  enlarge  the  class  of  dilTeomorphisms  tpt  by 
taking  all  dilTeomorphisms  tpt  p  given  by  the  flows  of  />  •/}•  Z  on 
A  |.  Observe  that  for  any  constant  <»,  <p0t,P  —  Vt^pt,  so  if  p  is  a 
constant  function.  <pliP  =  ipptA  —  <p,  pt.  Thus  {ptp}  =  {ip , 
so  by  abuse  of  notation  wc  will  write  <pp  lor  tp\iP.  QED  (n  =  1). 

A.  Discussion  of  what  we  have  shown  so  far 

In  the  monochrome  matching  of  2  regions  free  of  occlu¬ 
sions,  the  match  is  far  from  unique.  In  fact  there  are  essentially 
as  many  matches  as  Cr  functions  from  such  a  region  t.o  the  reals. 
1  his  stems  directly  from  tt  1  fact,  that  the  iso-brightness  loci  con¬ 
stitute  connected  differentiable  1-dimensioual  objects.  That  in 
turn  is  a  consequence  of  the  fact  that  the  picture  is  a  map  from 
a  2-dimensional  object  to  a  l-dimensional  object. 

Away  from  critical  points,  the  matching  diffeornorphisms 
can  differ  greatly:  contour  lines  can  be  slid  along  themselves  ar¬ 
bitrarily  large  amounts.  From  a  practical  point  of  view,  given  2 
pictures  and  a  matching  function,  it  is  a  simple  matter  to  choose 
a  p  and  compute  tpp,  the  time-one  map  of  p  ■/?•/,  giving  a  new 
match.  A  matching  strategy  based  on  this  analysis  would  first 
match  critical  points  (generically  a  discrete  combinatorial  prob¬ 
lem),  and  then  contour  lines  intersecting  the  gradients  through 
the  critical  points.  Of  course,  an  actual  program  would  also  have 
to  deal  with  noise,  digitization,  occlusion,  and  variation  of  image 
irradiancc  with  viewing  position;  and  it  would  have  additional 
constraints  available. 

U.  The  validity  of  the  intuitive  picture 

Although  the  theorem  is  proved  for  n  =  1,  it’s  not  clear 
how  the  intuitive  idea  of  the  proof  is  related  to  the  technical 
method  we  actually  used.  To  illuminate  that  and  to  disseminate 
some  interesting  facts  about,  such  mappings,  wc  now  turn  to  the 
validity  of  the  picture  (Fig.  (Irag))  w'c  presented  earlier  for  the 
structure  of  t  he  contour  lines.  This  will  require  some  basic  results 
from  differential  topology.  This  is  interesting  beyond  the  confines 
of  our  present  problem;  c.g.  it  casts  light  on  the  structure  of  zero 
crossings. 

First,  the  proof  given  above  (its  into  the  intuitive  scheme 
presented  earlier  for  using  Fig.  (frag),  since  p(i\7\  is  essentially 
the  rotation  function  0  we  discussed  earlier.  Hut  is  Fig.  (frag) 
a  reasonable  picture  for  the  contour  lines  of  a  picture  function? 
The  Inflowing  propositions  arc  consequences  of  the  implicit  func¬ 
tion  theorem,  Miluors  theorem  on  I -manifolds,  Sard’s  theorem, 
and  the  genericity  of  Morse  functions  (see  [lilicher  1983]  for 
details). 


1)  Almost  every  level  set  of  a  picture  is  a  circle  or  a  line 

2)  These  1-inanifolds  account  for  almost  all  of  the  brightness 
values;  the  rest  are  extrema  or  saddles  (critical  points). 

3)  Typically,  pictures  have  isolated  critical  points  (i.e.  the  criti¬ 
cal  points  do  not  form  blobs,  lines,  or  accumulations). 

Now,  here’s  what  all  this  means  in  terms  of  Fig.  (frag). 
Choose  a  picture  at  random.  (Say  the  picture  is  hounded  try  a 
rectangle  II  with  interior  V  .)  If  it  has  no  critical  points,  then 
all  the  level  sets  are  diffeomorphic  to  (disjoint  unions  of)  line 
segments  (and  not  circles,  which  are  the  only  other  possibility 
by  Milnor’s  result,  cited  above). 

Suppose  the  picture  does  have  critical  points.  Then 
“generically”  the  critical  points  are  isolated. 

First  let’s  see  what  happens  near  such  a  critical  point. 
Ity  Morse's  Lemma  [(iuillemin  197-1,  llirsch  197(i]  we  know  that 
there  is  a  coordinate  system  (u,  v)  in  a  neighborhood  of  the 
critical  point  />  such  that  /  =  f(p)  ±  u2  ±  v2  The  possible  signs 

correspond  to  a  maximum  ( - ),  a  minimum  (++),  or  a  saddle 

(H —  or  -  +  ).  So  for  an  extremum,  it’s  easy  to  see  that  the  level 
sets  are  just  a  point  surrounded  by  circles.  For  a  saddle,  the 
level  sots  arc  the  sets  n2  —  u2  —  const,  shown  in  Fig.  (saddle). 
Note  that  the  critical  point  is  isolated  (from  other  critical  points), 
though,  it  is  not  isolated  as  part  of  a  level  set. 
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The  Morse  inequalities  tell  us  that  the  Ruler  characteristic 
rs  related  to  the  number  and  type  of  critical  points.  In  our  case, 
if  we  assume  that  the  whole  region  of  interest  lies  within  a  single 
circular  level  set,  this  means  that  the  number  of  extrema  must 
be  I  more  than  the  number  of  saddles,  in  Fig.  (frag),  for  the 
rotation  directions  of  the  level  sets  to  be  consistent  with  the 
way  we  proved  the  first  part  of  the  theorem,  we  must  assume 
that  one  of  the  critical  points  is  a  maximum  ami  the  other  a 
minimum.  But  from  the  Morse  inequalities,  there  must  be  a 
saddle  somewhere,  too.  In  fact,  the  larger  picture  looks  like 
Fig.  (dimple),  and  when  there  are  two  maxima  (or  minima,  in 
Australia),  like  Fig.  (pass). 

C.  Open  dense,  usually,  generically,  almost  all,  typically 

A  crucial  result  used  above  is  that  (.lie  Morse  functions 
arc  open  dense.  This  allows  ns  to  restrict  our  attention  only 
to  pictures  whose  critical  points  are  isolated  and  thus  to  avoid 
considering  pathological  behavior.  A  property  shared  hy  all 
members  of  an  open  dense  subset  (or  a  countable  intersection 
of  such  subsets)  is  called  generic,  which  can  be  thought  of  as*- 
“most”  (see  [Blichcr  1983,  llirsch  I97(i,  Niteeki  1971,  Golubitsky 
1973,  Guillemin  1971)).  Then  the  (countable)  conjunction  of 
generic  properties  is  generic.  “Generic”  is  a  key  idea  in  modern 
deferential  topology. 

D.  The  cases  n  >  2 

Let  /  :  A/m  — *  At",  be  Cr  and  regular  at  p.  The  analysis 
is  based  on  the  fact  that  at  a  regular  point,  if  there  is  enough 
room  in  the  range  space,  /  is  a  dilTcomorphism  from  a  neighbor¬ 
hood  U  of  p  to  J(U).  This  is  yet  another  version  of  the  implicit 
function  theorem.  The  idea  of  enough  room  can  be  made  precise 
simply  by  requiring  the  Jacobian  to  he  1-1.  This  is  the  case  for  a 
regular  point  if  the  dimension  of  the  range  space  is  at  least  that 
of  the  domain  space,  i.e.  if  m  <  n,  which  is  the  situation  for  us 
if  there  arc  at  least  2  color  dimensions. 

As  be  'ore,  the  possible  maps  gn  which' solve  the  matching 
problem  arc  exactly  those  which  take  level  sets  to  level  sets. 
Since  the  gn  arc  dilTeomorphisms,  we  can  just  study  the  maps 
of  the  level  sets  of,  say,  l'\,  since  they  arc  equivalent  by  a  given 
g  to  tile  set  of  all  gn.  (To  see  this,  consider  Fig.  (equiv).  bet 
h  be  a  dilTeomorphism  which  takes  level  sets  to  level  sets,  i.e. 
which  makes  the  diagram  commutative,  and  define  g'K  =  yK  o  h, 
so  that  any  h  gives  us  a  g'n.  Likewise  given  such  a  g'n,  define 

h  =  0*'  0  d 


R" 


Fig.  (equiv) 

First,  let's  look  at  how  many  points  can  he  in  I']1  (p).  By 
the  implicit  function  theorem,  since  the  dimension  of  the  range 
(i.e.  the  color  space)  is  at  least  that  of  the  domain,  the  level 
set  of  a  regular  value  is  at  most  a  discrete  set  of  points.  Since 
we  arc  restricting  ourselves  to  compact  pictures,  the  level  set 
must  be  a  finite  set  (to  avoid  an  accumulation  point).  Hence 
on  a  level  set,  g„  is  constrained  to  be  one  of  a  finite  nnmhcr  of 
permutations  of  the  finite  level  set.  Furthermore,  since  /'j  is  a 


local  diiTeomorphism  at  a  regular  value,  the  permutation  cannot 
jump  around  wildly  among  neighboring  points,  so  that  in  fact  gn 
is  a  permutation  of  “sheets.” 

As  it  turns  out,  the  higher  dimensions  arc  easier  to  deal 
with  in  our  context,  so  wo  will  start  with  them. 

F.  Regular  points  when  n  >  3 

Theorem.  Let  M,N  be  embedded  submanifolds  of  Rn. 
Then  generically,  dim  M  -i-  dim  N  —  n  =  dim  M  D  N,  where  a 
negative  dimension  means  the  intersection  is  empty. 

Locally,  the  regular  sets  arc  embedded  submanifolds  (by 
the  inverse  function  theorem),  so  we  can  use  the  preceding  to 
study  the  inverse  images  of  regular  values.  In  particular,  F,  will 
fail  to  he  1-1  at  places  where  the  embedded  regular  sets  inter¬ 
sect.  We  are  interested  in  the  case  that,  dim  M  =  dim  N  =  2, 
so  we  see  that  the  intersection  is  generically  of  dimension  2,  1,0, 
and  empty  for  n  =  2, 3,  A,  5  resp.  Thus  if  n  >  3  there  is 
no  dilTeomorphism  h  other  than  the  identity  which  makes  Fig. 
(equiv)  commute,  i.e.  such  that  F,  =  Ft  °  k.  (I’roof:  h  must 
be  the  identity  wherever  f j  is  1-1.  By  the  above  theorem,  that 
is  generically  everywhere  except  on  a  lower  dimensional  sub¬ 
manifold.  lienee  h  is  the  identity  on  a  dense  subset,  and  by 
continuity  uniquely  extends  to  the  whole  space.  QFD.)  So  for 
the  regular  points,  we  have  disposed  of  all  the  cases  of  3  or  more 
color  dimensions.  Now  we  look  at.  the  singular  points,  and  their 
dimension. 

The  genericity  of  Morse  functions  can  he  generalised  as 
follows. 

TheoremfCritical  set  dimension).  For  an  open  dense 
subset  of  Cf{Mm,Nn),  the  set  of  critical  points  of  /  where  the 
Jacobian  is  of  rank  r 

1)  comprise  a  submanifold  of  Mm 

2)  =  0  if  [rn  —  r)(n  —  r)  >  m 

3)  is  of  codimension  (m  — r)(n  — r)  in  Mm  if  (m-r)(n-r)  <  m 

(  X  is  of  ecdimeniionk  in  Y  if  diinX  +  it  —  diinT.) 

Before  we  get  involved  in  studying  the  critical  sets 
for  various  color  dimensions,  we  state  2  more  closely  related 
theorems  which  allow  us  to  immediately  understand  the  situa¬ 
tions  for  i  or  more  color  dimensions.  An  immediate  consequence 
of  tlie  critical  set  dimension  theorem  is  the 

Theorem  (Whitney  Immersion  Theorem).  If  X,  Y  are 
smooth  manifolds,  with  dim  Y  >  2  •  dim  A',  then  maps  with  no 
singular  points  are  open  dense  in  C°°( X,  K). 

For  a  picture,  dim  A'  =  2,  so  the  above  theorem  applies 
when  there  are  at  least  \  color  dimensions.  In  that  case,  it 
states  that  the  typical  picture  won’t  have  any  singularities  at 
all.  Hence,  typically  there  is  only  one  “sheet”  and  no  folds. 

A  further  result  is  the 

Theorem  fWhitncy  1-1  Immersion  Theorem).  If  X ,  Y 
arc  smooth  manifolds,  with  dim  Y  >  2-dim  X  -I- 1,  then  1-1  maps 
with  no  singular  points  arc  residual  (i.e.  generic)  in  C°°(X,Y). 

So  with  at  least  5  color  dimensions,  we  can  assume  no 
color  is  used  twice. 

Returning  to  the  critical  set  dimension  theorem,  in  our 
ca  m  =  2,  so  what  the  theorem  tells  us  is  that  the  dimension 
of  Ur.  critical  set  is  respectively  1,  0,  for  n  =  2,3,  and  it  is  empty 
for  n  >  1. 
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Ry  reasoning  as  we  did  for  multiple  points  of  the  regular 
set,  h  has  unique  continuation  (o  the  eritiral  set  for  n  >  3, 
yielding  the  conclusion  that  h  is  gencrically  unique  when  n  >  3 
(for  n  =  3  the  1-1  set  need  not  he  dense,  so  the  conclusion 
wouldn’t  follow). 

To  summarize,  we  have  thus  far  shown  that  h  must  be  the 
identity  for  n  >  3,  and  is  at  worst  one  of  a  discrete  set  of  sheet 
permutations  for  n  —  2.  Now  we  will  pursue  the  case  n  =  2  a 
bit  further 

F.  More  about  n  ~  2 

If  we  allow  the  support  of  a  picture  to  he  all  of  R2  or  S2, 
that  is  all  we  can  say.  (Consider,  e.g.,  the  function  z  r->  zk  (for 
some  k  >  2)  on  the  complex  plane  for  the  picture  function.  Then 
the  sheets  can  he  permuted  leaving  the  picture  invariant.)  Hut 
a  real  picture  must  he  finite  in  extent,  so  if  we  are  considering 
subsets  of  the  plane,  a  rectangle  (i.e.  a  disc)  is  an  appropriate 
domain  to  consider.  IT  we  are  thinking  about  the  sphere,  then 
since  we  are  restricting  ourselves  to  occlusion- free  regions,  using 
the  entire  sphere  would  imply  that  there  were  no  observable 
occlusions,  which  could  only  happen  in  the  improbable  events 
that  only  one  object  was  illuminated,  or  that  the  observer  could 
only  sec  an  object  which  completely  enclosed  him.  Right  now  we 
arc  only  concerned  with  the  generieity  of  mappings  of  the  plane, 
sinee  we  are  in  the  context  of  general  matching,  so  we  will  make 
no  claims  regarding  the  generieity  of  occlusion  or  illumination, 
though  such  an  analysis  is  possible. 

Let  us  now  assume  that  the  picture  support  we  are 
considering  is  topologically  a  disc.  In  that  case,  h,  being  a 
homeomorphism,  must  map  the  boundary  of  the  disc  (a  circle 
S)  to  itself.  If  f  is  1-1  then  h  must  be  the  identity.  If  not,  then 
consider  what  must  happen  on  this  circle,  h  must  be  continuablc 
along  S,  so  for  p  6  S,  f~'(p)  must  contain  a  constant  number 
of  points.  This  excludes  the  possibility  of  transverse  crossings  of 
f{S).  Hut  transverse  crossings  for  such  a  map  are  generic,  so  h 
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I.  Introduction 


|  In  lieu  of  a  complete  segmentation  approach, 

•  we?  attempt  a  partial  segmentation  for  natural 
•scenes  to  extract  the  large  textured  regions.  The 
usefulness  of  texture  processing  as  an  early  stage 
in  the  overall  system,  however,  heavily  depends  on 
the  nature  of  the  task  and  the  image  data. 
Therefore,  as  mentioned  in  an  earlier  report  ){!-(-,  a 
way  of  determining  the  presence  or  absence  of 
texture  should  precede  any  attempt  to  perform  a 
texture  extraction  operation. 

A  simple  method  to  predict  texture  presence  or 
absence  using  a  pyramid  structure  is  presented.  A 
new  texture  measure  is  defined  and  used  in  a 
region-growing  extraction  scheme. 


II.  Prediction  of  Texture  Presence  and 
Uniformity  Texture  Measures 

Local  uniformity  and  the  change  in  uniformity 
at  different  resolutions  are  used  as  the  texture 
cue  in  our  extraction  scheme.  While  constructing  a 
level  of  the  intensity  pyramid  where  level  L  is 
obtained  by  nonoverlapped  block  averaging  of  level 
L-l ,  the  corresponding  levels  of  the  uniformity 
pyramid  indicating  the  local  uniformity  at  that 
resolution  and  of  the  uniformity-change  (UC) 
pyramid  indicating  the  local  uniformity  change  from 
the  lowest  level  are  computed  as  shown  in  Fig.  1. 
The  underlying  supposition  on  which  these  measures 
are  valid  is  that  t  Ire  averaging  process  in 
constructing  a  pyramid  structure  changes  a  large 
textured  region  into  a  uniform  luminance  region  at 
the  level  where  the  averaging  window  approximately 
equals  the  size  of  the  collection  of  texture 
primitives.  This  is  true  only  if  the  variations  in 
the  illumination  and  the  primitive  size  are  small 
throughout  the  region.  On  the  assumption  that  it 
is  true,  we  can  make  the  following  conjectures: 


If  the  given  image  has  a  large  portion 
of  textured  regions,  the  overall 
uniformity  keeps  increasing  as  the  size 
of  the  averaging  window  becomes  larger 
until  it  exceeds  the  largest  primitive 
size  so  that  there  is  .no  more 
improvement  in  uniformity. 


portion  is  small,  the  averaging  process 
may  not  improve  the  uniformity  or  may 
even  decrease  it. 

The  average  at  each  level  of  the  uniformity  pyramid 
(taken  over  the  entire  image  or  a  portion  of  it)  is 
used  in  determining  the  presence  of  large  textured 
regions  and  estimating  the  proper  level  of 
resolution  which  is  compatible  with  the  size  of  the 
texture . 

Two  test  images  (Fig.  2)  are  used  to  verify 
the  conjecture  we  made  above.  These  two  views  of 
the  same  scene  taken  at  different  seasons  have  a 
resolution  of  512*. 72  pixels.  The  average  values 
a.  different  levels  are  shown  in  Table  la  and  Table 
lb  for  Fig.  2a  and  Fig.  2b  respectively.  For  the 
strong  textural  structure  of  the  forested  regions 
in  the  first  view  the  average  value  does  indeed 
decrease  at  level  3  and  level  4,  while  it  begins  to 
increase  at  level  5.  From  this  result,  we  can  not 
only  predict  texture  presence  but  also  assume  that 
the  diameters  of  the  texture  primitives  lie  between 
8  and  16.  Therefore,  we  can  predict  that  the  best 
level  for  texture  extraction  in  the  pyramid  is 
level  3  which  shows  the  first  significant  decrease 
in  the  average  values.  For  the  second  view,  which 
has  a  much  weaker  textural  structure,  the  average 
increases  until  level  4  and  then  decreases  somewhat 
at  level  5  and  level  6.  This  improvement  in 
uniformity,  however,  can  not  be  considered  as  the 
sign  of  texture  presence,  since  texture  primitives 
with  a  diameter  between  32  and  64  are  very  unlikely 
in  an  512*512  aerial  image.  Though  more  tests  with 
a  variety  of  images  should  be  made  before  we 
confirm  the  validity  of  the  conjectures,  these 
results  show  the  potential  of  this  simple  method. 
The  level  3  uniformity  and  UC  images  of  Fig.  2a  are 
shown  in  Fig .  3. 


III.  Extraction  of  Textured  Regions 

Previous  approaches  to  segment  an  image  by 
texture  [2,3,4]  were  directed  to  completely  divide 
the  image  into  regions  of  uniform  textural 
properties  without  specifically  separating  the 
textured  portions  from  the  untextured  ones.  Since 
the  untextured  portions  of  an  image  can  be 
segmented  more  easily  and  accurately  using  single 
pixel  properties,  our  approach  is  to  extract 
connected  textured  regions  one  by  one  and  leave  the 
untextured  portion  untouched  for  other  stages  of 
single  pixel  processing. 


2.  If  the  image,  on  the  other  hand,  is 
devoid  of  texture  or  the  textured 


IV.  Conclusions 


Here,  the  information  from  a  conservatively 
extracted  region  guides  the  region  growing  in  the 
next  lower  level.  The  resulting  region  boundary  is 
then  refined  using  the  information  from  the  lower 
level.  Therefore,  three  consecutive  levels  of  the 
pyramids  are  involved  in  extracting  compact 
textured  regions.  At  level  T+l  ,  where  level  T  is 
the  one  compatible  with  the  texture,  compact 
regions  with  high  uniformity  and  large  uniformity- 
change  are  selected  as  starting  elements.  At  level 
T,  one  of  the  starting  elements  (magnified  by  the 
factor  of  2  to  take  account  of  the  level  descent) 
is  grown  by  merging  neighboring  pixels  whose 
uniformity  and  UC  values  lie  inside  the  uniformity 
and  UC  ranges  of  the  magnified  element.  Though  it 
is  unreliable  to  use  the  level  T-l  uniformity  and 
UC  values  inside  the  textured  region,  a  textured 
region  often  adjoins  untextured  regions  or  regions 
with  a  texture  of  different  primitive  size,  which 
are  detectable  at  level  T-l  (  e.g.  a  forest  region 
touching  rivers  or  roads  in  an  aerial  image).  At 
level  T-l,  therefore,  boundary  refining  is  carried 
out  by  eliminating  the  untextured  or  differently 
textured  portions  (with  low  uniformity  or  small  UC 
values)  from  the  search  area  which  is  constructed 
at  level  T  and  magnified  by  2  to  be  compatible  at 
level  T-l.  (  After  the  region  growing  stops  at 
level  T,  the  search  area  is  formed  by  the  exterior 
boundary  pixels  as  well  as  boundaries  of  holes  and 
their  neighboring  pixels  within  a  distance  of  1.) 
After  one  region  is  extracted,  the  process  is 
repeated  using  another  starting  element  which  is 
separate  from  the  detected  regions.  The  starting 
elements  from  Fig.  2a  are  shown  in  Fig.  4  and  the 
binary  images  of  the  resulting  regions  after  each 
step  derived  from  the  largest  starting  element  are 
shown  in  Fig.  5.  Fig.  6  shows  the  boundaries  of 
the  extracted  regions  on  the  original  image  (Fig. 
2a).  The  final  result  on  another  test  image  (high 
altitude  image  of  the  San  Francisco  urban  area)  is 
shown  in  Fig.  7. 


Figure  1.  4  by  4  block  at  level  l.-l  involved  in  the 

computation  of  level  L  features 


The  tests  on  two  aerial  images  show  that  the 
proposed  technique  can  extract  large  textured 
regions  fairly  well.  Without  the  sophisticated 
description  of  textures  from  a  stochastic  or 
structural  model,  simple  texture  measures  achieve 
sufficient  results  in  certain  natural  image  domain. 
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A(k,2.)  is  the  average  inside  the  2  by  2  block  whose 
lower  co-ordinate  is  (k, l) ,  i.e., 


A(k,  i.) 


GL-l(m’n) 


m=k-l  n=4-l 


Level  L  intensity:  GL(i,j)  =  a(2i,2j) 

Level  L  uniformity: 

U  (i, j)  =  var {A(2 i-l),2j-l),A(2i-l,2j+l) 

L  A(2 i+1) , 2j-l)  , A(2i+1 ,2j+l)  } 

Level  L  uniformity-change: 


UC  (i,j)  = /var  on  the  level  8  \ 

I  values  inside  the  whole!-  U  (i,j) 
V  block  / 


Low  altitude  aerial 


Level  3  images  of  Fig.  2a 


(a)  uniformity 

(b)  unlf  ormity-c  hange 


Figure 
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Results  from  the  largest  starting  element 
after  each  step 


(a)  starting  element  at  level  U 

(b)  region  grown  from  (a)  at  level  3 

(c)  search  area  of  (b) 

(d)  after  the  elimination  of  the  unti 
portions  at  level  2 
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ABSTRACT 

A  complete  mathematical  treatment  is  given  for 
describing  the  topographic  primal  sketch  of  the 
underlying  grey  tone  intensity  surface  of  a  digital 
image.  Eacb  picture  element  is  independently 
classified  and  assigned  a  unique  descriptive  label, 
invariant  under  monotonical ly  increasing  gray  tone 
transformations  from  the  set  (peak,  pit,  ridge, 
ravine,  saddle,  flat,  and  hillside),  with  hillside 
having  subcategories  (inflection  point,  slope, 
convex  hill,  concave  bill,  and  saddle  hill).  The 
topographic  classification  is  based  on  the  first 
and  second  directional  derivatives  of  tbe  estimated 
image  intensity  surface.  A  local,  facet  model, 
two-dimensional,  cubic  polynomial  fit  is  done  to 
estimate  the  image  intensity  surface.  Zero- 
crossings  of  the  first  directional  derivative  are 
identified  as  locations  of  interest  in  the  image. 
Results  of  the  technique  appHrd  to  digital  terrain 
data  and  aerial  photograph.  ,ied  in  the  Passive 
Image  Navigation  study  are  presented 

1 .  INTRODUCTION 

Representing  the  fundamental  structure  of  a 
digital  image  in  a  rich  and  robust  way  is  a  primary 
problem  encountered  in  any  general  robotics 
computer-vision  system  that  has  to  "understand" 
an  image.  The  richness  is  needed  so  that  shading, 
highlighting,  and  shadow  information,  which  are 
usually  present  in  real  manufacturing  assembly  line 
situations,  are  encoded.  Richness  permits 
unambiguous  object  matching  to  be  accomplished. 
Robustness  is  needed  so  that  tbe  representation  is 
invariant  with  respect  to  monotonical ly  increasing 
gray  tone  transformations.  Current  representations 
involving  edges  or  the  primal  sketch  as  described 
by  Marr  (1976;  1980)  are  impoverished  in  the  sense 
that  they  are  insufficient  for  unambiguous 
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matching.  They  also  do  not  have  the  required 
invariance.  Basic  research  is  needed  to  (1) 
define  an  appropriate  representation,  (2)  develop  a 
theory  that  establishes  its  relationship  to 
properties  that  three-dimensional  objects  manifest 
on  the  image,  and  (3)  prove  its  utility  in 
practice.  Until  this  is  done,  computer-vision 
research  must  inevitably  be  more  ad  hoc 
sophistication  than  science. 

The  basis  of  the  topographic  primal  sketch 
consists  of  the  classification  and  grouping  of  the 
underlying  image  intensity  surface  patches 
according  to  the  categories  defined  by  monotonic, 
gray  tone,  invariant  functions  of  directional 
derivatives.  Examples  of  such  categories  are  peak, 
pit,  ridge,  ravine,  saddle,  flat,  and  hillside. 
From  this  initial  classification,  we  can  group 
categories  to  obtain  a  rich,  hierarchical,  and 
structurally  complete  representation  of  the 
fundamental  image  structure.  We  call  this 
representation  the  topographic  primal  sketch. 

Why  do  we  believe  that  this  topographic  primal 
sketch  can  be  the  basis  for  computer  vision?  V?e 
believe  it  because  the  light-intensity  variations 
on  an  image  are  caused  by  an  object's  surface 
orientation,  its  reflectance,  and  characteristics 
of  its  lighting  source.  If  any  of  the  three- 
dimensional  intrinsic  surface  characteristics  are 
to  be  detected,  they  will  be  detected  owing  to  the 
nature  of  light-intensity  variations.  Thus,  the 
first  step  is  to  discover  a  robust  representation 
that  can  encode  the  nature  of  these  light-intensity 
variations,  a  representation  that  does  not  change 
with  strength  of  lighting  or  with  gain  settings  on 
the  sensing  camera.  The  topographic  classification 
does  just  that.  The  basic  research  issue  is  to 
define  a  set  of  categories  sufficiently  complete  to 
form  groupings  and  structures  that  havf  strong 
relationships  to  the  reflectances,  surface 
orientations,  and  surface  positions  of  the  three- 
dimensional  objects  viewed  in  the  image. 
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1.1.  The  Invariance  Requirement 

A  digital  image  can  be  obtained  with  a  variety 
cf  sensing-car.era  gain  settings.  It  can  be  visually 
enhanced  by  an  appropriate  adjustment  of  the 
camera's  dynamic  range.  The  gain  setting  or  the 
enhancing  poirt  operator  changes  the  image  by  some 
monotonical ly  increasing  function  that  is  not 
necessarily  linear.  For  example,  nonlinear 
enhancing  point  operators  of  this  type  include 
histogram  normalization  and  equal  probability 
quantization. 

In  visual  perception,  exactly  the  sane  visual 
interpretation  and  understanding  of  a  pictured 
scene  occurs  whether  the  camera's  gain  setting  is 
low  or  high  and  whether  the  image  is  enhanced  or 
unenhanced.  The  only  difference  is  that  the 
enhanced  image  has  more  contrast,  is  nicer  to  look 
at,  and  is  understood  more  quickly  by  the  human 
visual  system. 

This  fact  is  important  because  it  suggests 
that  many  of  the  current,  low-level  computer-vision 
techniques,  which  are  based  on  edges,  cannot  ever 
hope  to  have  the  robustness  associated  with  human 
visual  perception.  They  cannot  have  the 
robustness,  because  they  are  inherently  incapable 
°t  invariance  under  monotonic  transformations.  For 
example,  edges  based  on  zero-crossings  of  second 
derivatives  will  change  in  position  as  the 
monotonic  gray  tone  transformation  changes  because 
convexity  of  a  gray  tone  intensity  surface  is  not 
preserved  under  such  transformations.  However,  the 
topographic  categories  peak,  pit,  ridge,  valley, 
saddle,  flat,  and  hillside  do  have  the  required 
invariance . 

1.2.  Background 

Karr  (1976)  argues  tbat  the  first  level  of 
visual  processing  is  the  computation  of  a  rich 
descriotion  of  gray  level  changes  present  in  an 
image,  and  that  all  subsequent  computations  are 
done  in  terms  of  this  description,  which  he  calls 
the  primal  sketch.  *£ray—  level  changes  are  usually 
associat  d  with  edges,  and  Karr's  primal  sketch 
has,  for  each  area  of  gray  level  change,  a 
description  that  includes  type,  position, 
orientation,  and  fuzziness  of  edge.  Marr  (1980) 
illustrates  that  from  this  information  it  is 
sometimes  possible  to  reconstruct  the  image  to  a 
reasonable  degree.  Unfortvnately,  as  mentioned 
earlier,  edge  is  not  invariant  with  respect  to 
monotonic  image  transformations;  besides,  it  is  not 
a  rich  enough  structure.  Difficulty,  for  example, 
has  been  experienced  in  using  edges  to  accomplish 
unambiguous  stereo  matching. 

The  topographic  primal  sketch  we  are 
discussing  as  a  basis  for  a  representation  has  the 
required  richness  and  invariance  properties  and  is 
very  much  in  the  spirit  of  Karr’s  primal  sketch  and 
the  thinking  behind  Ehricb's  relational  trees 
(Ehrich  and  Foith  1978) .  Instead  of  concentrating 
on  gray  level  changes  as  edges  as  Karr  does,  or  on 
one-dimensional  extrema  as  Ehrich  and  Foith  do,  wc 
concentrate  on  all  types  of  two-dimensional  gray 


level  variations.  We  consider  each  area  on  an 
image  to  be  a  spatial  distribution  of  gray  levels 
that  constitutes  a  surface  or  facet  of  gray  tone 
intensities  having  a  specific  surface  shape.  It  is 
likely  that,  if  we  could  describe  the  shape  of  the 
gray  tone  intensity  surface  for  each  pixel,  then  by 
assembling  all  the  shape  fragments  we  could 
reconstruct,  in  a  relative  way,  the  entire  surface 
of  the  image's  gray  tone  intensity  values.  The 
shapes  that  we  already  know  about  that  have  the 
invariance  property  are  peak,  pit,  ridge,  ravine, 
saddle,  flat,  and  hillside,  with  hillside  having 
noninvariant  subcategories  of  slope,  inflection, 
saddle  hillside,  convex  hillside,  and  concave 
hillside . 

Knowing  that  a  pixel’s  surface  has  the  shape 
of  a  peak  does  not  tell  us  precisely  where  in  the 
pixel  the  peak  occurs;  nor  does  it  tell  us  the 
height  of  the  peak  or  the  magnitude  of  the  slope 
around  the  peak.  The  topographic  labeling, 
however,  does  satisfy  Harr's  (1976)  primal  sketch 
requirement  in  that  it  contcins  a  symbolic 
description  of  the  gray  tone  intensity  changes. 
Futhermore,  upon  computing  and  binding  to  each 
topographic  label  numerical  descriptors  such  as 
gradient  magnitude  and  direction,  directions  of  the 
extrema  of  the  second  directional  derivative  along 
with  their  values,  a  reasonable  absolute 
description  of  each  surface  shape  can  be  obtained. 

1.3.  Facet  Model 

The  facet  model  states  that  all  processing  of 
digital  image  data  has  its  final  authoritative 
interpretation  relative  to  what  the  processing  does 
to  the  underlying  gray  tone  intensity  surface.  The 
digital  image’s  pixel  values  are  noisy  sampled 
observations  of  the  underlying  surface.  Thus,  in 
order  to  do  any  processing,  we  at  least  have  to 
estimate  at  each  pixel  position  what  this 
underlying  surface  is.  This  requires  a  model  that 
describes  what  the  general  form  of  the  surface 
would  be  in  the  neighborhood  of  any  pixel  if  there 
were  no  noise.  To  estimate  the  surface  from  the 
neighborhood  around  a  pixel  then  amounts  to 
estimating  the  free  parameters  of  the  general  form. 
It  is  important  to  note  that  if  a  different  general 
form  is  assumed,  then  a  different  estimate  of  the 
surface  is  produced.  Thus  the  assumption  of  a 
particular  general  form  is  necessary  and  has 
consequence  s . 

The  general  form  we  use  is  a  bivariate  cubic. 
V’e  assume  that  the  neighborhood  around  each  pixel 
is  suitably  fit  by  a  bivariate  cubic  (Haralick 
1 9 SI ; 1 982 )  .  Having  estimated  this  surface  around 
each  pixel,  the  first  and  second  directional 
derivatives  are  easily  computed  by  analytic  means. 
The  topographic  classification  of  the  surface  facet 
is  based  totally  on  the  first  and  second 
directional  derivatives.  We  classify  each  surface 
point  as  peak,  pit,  ridge,  ravine,  saddle,  flat,  or 
hillside,  with  hillside  being  broken  down  further 
into  the  subcategories  inflection  point,  convex 
hill,  concave  hill,  saddle  hill,  and  slope.  Our 
set  of  topographic  labels  is  complete  in  the  sense 
that  every  combination  of  values  of  the  first  and 
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second  directional  derivative  is  uniquely  assigned 
to  one  of  the  Classes. 

1.4.  Previous  Work 

Detection  of  topographic  structures  in  a 
digital  image  is  not  a  new  idea.  There  has  been  a 
wide  variety  of  techniques  to  detect  (a)  peaks  and 
pits  (spots),  (b)  ridges  and  ravines  (lines, 
streaks),  (c)  hillsides  (edges),  and  other  local 
features.  Some  of  this  work  includes  Fischler 
(198?'.  Lee  and  Fu  (1981),  Hsu,  Mundy,  and  Beaudet 
(1978),  Toriwaki  and  Fukurma  (1978),  Grender 
(1976),  Paton  (1975),  Johnston  and  Rosenfeld 
(1976),  Rosenfeld  and  Kak  (1976)  and  Peuker  and 
Douglas  (1975).  Detailed  discussion  of  these 
methods  are  beyond  the  scope  of  this  paper.  For  an 
excellent  discussion  of  these  works  the  reader  is 
referred  to  Laffey  (1983). 

1.5.  A  Mathematical  Approach 

From  the  investigation  of  previous  work,  one 
can  see  that  a  wide  variety  of  methods  and  labels 
have  been  proposed  to  describe  the  topographic 
structure  in  a  digital  image.  Some  of  the  methods 
require  multiple  passes  through  the  image,  while 
others  may  give  ambiguous  labels  to  a  pixel.  Many 
of  the  methods  are  heuristic  in  nature.  The  Hsu, 
Mundy,  and  Beudet  (1978)  approach  is  the  most 
similar  to  the  one  discussed  here. 


In  Section  4,  we  will  discuss  the  local  cubic 
estimation  scheme.  In  Section  5,  we  will  summarize 
the  algorithm  for  topographic  classification  using 
the  local  facet  model.  In  Section  6,  we  will  show 
the  results  of  the  classifier  on  digital  terrain 
data  and  aerial  photographs. 

2 .  THE  MATHEMATICAL  CLASSIFICATION  OF  TOPOGRAPHIC 
STRUCTURES 

In  this  section,  we  formulate  our  notion  of 
topographic  structures  on  continuous  surfaces  and 
show  their  invariance  under  monotonical ly 
increasing  gray  tone  transformations.  In  order  to 
understand  the  mathematical  properties  used  to 
define  our  topographic  structures,  one  must 
understand  the  idea  of  the  directional  derivative 
discussed  in  most  advanced  calculus  books.  For 
completeness,  we  first  give  the  definition  of  the 
directional  derivative,  then  the  definitions  of  the 
topographic  labels.  Finally,  we  show  the 
invariance  under  mono  tonic  a  1 ly  increasing  gray  tone 
transformations. 

2.1.  The  Directional  Derivative 

In  two  dimensions,  the  rate  of  change  of  a 
function  f  depends  on  direction.  We  denote  the 
directional  dcrivatiye  of  f  at  the  point  (r,c)  in 

the  direction  p  by  f„(r,c).  It  is  defined  as 

P 


Our  classification  approach  is  based  on  the 
estimation  of  the  first-and  second-order 
directional  derivatives.  Thus,  we  regard  the 
digital-picture  function  as  a  sampling  of  the 
underlying  function  f,  where  some  kind  of  random 
noise  is  added  to  the  true  function  values.  To 
estimate  the  first  and  second  partials,  we  must 
assume  some  kind  of  paranetric  form  for  the 
underlying  function  f.  The  classifier  must  use  the 
sampled  brightness  values  of  the  digital-picture 
function  to  estimate  the  parameters  and  then  make 
decisions  regarding  the  locations  of  relative 
extrema  of  partial  derivatives  based  on  the 
estimated  values  of  thi  parameters. 

In  Section  2,  we  will  discuss  the  mathematical 
properties  of  the  topographic  structures  in  terms 
of  the  directional  derivatives  in  the  continuous 
surface  domain.  Because  a  digital  image  is  a 
sampled  surface  and  each  pixel  has  an  area 
associated  with  it,  characteristic  topographic 
structures  may  occur  anywhere  within  a  pixel's 
area.  Thus,  the  implementation  of  the  mathematical 
topographic  definitions  is  not  entirely  trivial. 

In  Section  3  we  will  discuss  the 
implementation  of  the  classification  scheme  on  a 
digital  image.  To  identify  categories  that  are 
local  one-dimensional  extrema,  such  as  peak,  pit, 
ridge,  ravine,  and  saddle,  we  search  inside  the 
pixel's  area  for  a  zero-crossing  of  the  first 
directional  derivative.  The  directions  in  which  we 
seek  the  zero-crossing  are  along  the  lines  of 
extreme  curvature. 


f ( r+h*sinp, c+h*cosp )  -  f(r,c) 

1  - - 

h->0  h 


The  direction  angle  p  is  the  clockwise  angle  from 
the  column  axis.  It  follows  directly  from  this 
definition  that 


3J(r,c)  *  sinp  +  d_f(r,c)  *  cosp. 
dr  0  c 


V/c  denote  the 
(r,c)  in  the 
follows  that 


second  derivative ,of  f  at  the  point 
direction  p  by  f  (r,c)  and  it 


,,  d2f  2  a2f  d2f 

f„  =  — 2*sin  P  +  2* - *sinp*cosP  +  — -*cos2p. 

dr  drdc  dc 


The  gradient  of  f  is  a  vector  whose  magnitude. 


at  a  given  point  (r,c) 
change  of  f  at  that  point. 


if  A 

dr  * 


is  the  maximum  rate 
and  whose  direction. 


tan  j - 

’  df 

l0c  l 

is  the  direction  in  which 
greatest  rate  of  change. 


of 


the  surface  has  the 
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Vie  will  use  the  following  notation  to  describe 
the  mathematical  properties  of  our  various 
topographic  categories  for  continuous  surfaces.  Let 

rf  =  gradient  vector  of  a  function  f; 


llrfll 

W(1> 


0) 


(2) 


gradient  magnitude; 

unit  vector  in  direction  in  which 
second  directional  derivative  has 
greatest  magnitude; 

unit  vector  orthogonal  to 


k 


1 


=  value  of  second  direc 
in  the  direction  of  o) 


derivative 


*2 


,.  (1) 
rf  a> 


=  value  of  second  direcj^ynal  derivative 
in  the  direction  of  or  ; 

=  value  of  first  directorial  derivative 
in  the  direction  of  o>  ;  and 


,.  (2) 

rf  (u 


=  value  of  first  director* 
in  the  direction  of  o' 


al  derivative 


Without  loss  of  generality,  we  assume  Ik^l  >=  Ikjl. 


Each  type  of  topographic  structure  in  our 
classification  scheme  is  defined  in  terms  of  the 
above  quantities.  In  order  to  calculate  these 
values,  the-first  and  second-order  partials  with 
respect  to  r  and  c  need  to  be  approximated.  These 
five  partials  are  as  follows: 


0f 

0f 

32f 

* 

32  f 

t _ 

32f 
# _ 

0r 

0C 

3r2 

3c2 

3r3c 

3f  3  f 


The  gradient 

vector  is 

simply  3 r  *  to 

The  second 

directional 

derivatives  may  be 

calculated  by 

forming  the 

Hessian 

where  the  Hessian  is  a  2*2 

matrix  defined  as 

1  32  f  32f 
| _  _ 

1 

1 

1  3r2  3r3c 

1 

1 

I!  = 

1  2  2 

1  3zf  3zf 

1 

I  3c3r  32f3c2 


Hessian  matrices  are  used  extensively  in 
nonlinear  programming..  Only  three  parameters  are 
required  to  determine  the  Hessian  matrix  II,  since 
the  order  of  differentiation  of  the  cross  partials 
may  be  interchanged.  That  is 


32f 


2 

3  f 


The  eigenvalues  of  the  Hessian  are  the  values 
of  the  extrema  of  the  second  directional 
derivative,  and  their  associated  eigenvectors  are 
the  directions  in  which  the  second  directional 
derivative  is  extremized.  This  can  easily  be  seen 
by  rewriting  f”  as  the  quadratic  form 

f„  =  (  sin(i  cosp  )  *  H  *  |  sinp  |. 

I  cosp  I 


Thus  , 


Hu 


(1) 


_  i  (1)  ,  „  (2) 

=  and  Hu  =  k^w 


(2) 


Furthermore,  the  two  directions  represented  by  the 
eigenvectors  are  orthogonal  to  one  another.  Since 
II  is  a  2*2  symmetric  matrix,  calculation  of  the 
eigenvalues  and  eigenvectors  can  be  done 
efficiently  and  accurately  using  the  method  of 
Rutishauser  (1971).  We  may  obtain  the  values  of  the 
first  directional  derivative  in  the  direction  of 
either  extrema  of  the  second  directional  derivative 
by  simply  taking  the  dot  product  of  the  gradient 
with  the  appropriate  eigenvector* 


rf  ‘  e> 
rf  ‘  (a 


(1) 

(2) 


There  is  a  direct  relationship  between  the 
eigenvalues  k  and  k  and  curvature  in  the 

directions  o>  ’  and  w  :  When  the  first 

directional  derivative  rf‘(i>  1  =  0,  then 

k./(l+(rf ‘rf |)j'  is  the  curvature  in  the 
direction  w  ,  i  -  1  or  2.  For  further 
discussion  on  the  relationship  of  surface  curvature 
to  directional  derivative,  see  Laffey  (1983). 

Having  the  gradient  magnitude  and  direction 
and  the  eigenvalues  and  eigenvectors  of  the 
Hessian,  we  can  describe  the  topographic 
classification  scheme, 

2.2.1.  Peak 


A  peak  (knob)  occurs  where  there  is  a  local 
maxima  in  all  directions.  In  other  words,  we  are 
on  a  peak  if,  no  matter  what  direction  we  look  in, 
we  see  no  point  that  is  as  high  as  the  one  we  are 
on.  The  curvature  is  downward  in  all  directions. 
At  a  peak  the  gradient  is  zero,  and  the  second 
directional  derivative  is  negative  in  all 
directions.  To  test  whether  the  second  directional 
derivative  is  negative  in  all  directions,  we  just 
have  to  examine  the  value  of  the  second  directional 
derivative  in  the  directions  that  make  it  smallest 
and  largest.  A  point  is  therefore  classified  as  a 
peak  if  it  satisfies  the  following  conditions: 

I Irf | I  =  0,  X1  <  0,  k2  <  0. 

2.2.2.  Pit 


A  pit  (sink,  bowl)  is  identical  to  a  peak 
except  that  it  is  a  local  minima  in  all  directions 
rather  than  a  local  maxima.  At  a  pit  the  gradient 
is  zero,  and  the  second  directional  derivative  is 
positive  in  all  directions.  A  point  is  classified 
as  a  pit  if  it  satisfies  the  following  conditions: 
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3r3c  3c3r 


2.2.5.  Saddle 


I Ivfl I  =  0,  X1  >  0,  X2  >  0. 

2.2.3  Ridge 

A  ridge  occurs  on  a  ridge-line,  a  curve 
consisting  of  a  series  of  ridge  points.  As  we  walk 
along  the  ridge-line,  the  points  to  the  right  and 
left  of  us  are  lower  than  the  ones  we  are  on. 
Furthermore,  the  ridge-line  may  be  flat,  slope 
upward,  slope  downward,  curve  upward,  or  curve 
downward.  A  ridge  occurs  where  there  is  a  local 
maximum  in  one  direction.  Therefore,  it  must  have 
negative  second-directional  derivative  in  the 
direction  across  the  ridge  and  also  a  zero  first- 
directional  derivative  in  that  same  direction.  The 
direction  in  which  the  local  maximum  occurs  may 
correspond  to  either  of  the  directions  in  which  the 
curvature  is  '  '  extremized ' ' ,  since  the  ridge  itself 
may  be  curved.  For  nonflat  ridges,  this  leads  to 
the  first  two  cases  below  for  ridge 
cnaracterization.  If  the  ridge  is  flat,  then  the 
ridge-line  is  horizontal  and  the  gradient  is  zero 
along  it.  This  corresponds  to  the  third  case.  The 
defining  characteristic  is  that  the  second 
directional  derivative  in  the  direction  of  the 
ridge-line  is  zero,  while  the  second  directional 
derivative  across  the  ridge-line  is  negative.  A 
point  is  therefore  classified  as  a  ridge  if  it 
satisfies  any  one  of  the  following  three  sets  of 
conditions : 


1 Ivf  1  1 

0, 

X1 

or 

X2 

< 

0, 

vf 

.  (1) 
(1) 

=  0 

1  Ivf  1  1 

* 

0, 

< 

0, 

vf 

.  (2) 
O) 

»  0 

or 

1  Ivfl  | 

- 

9, 

xi 

< 

0, 

-  0. 

A  geometric  way  of  thinking  about  the 
definition  for(.jidge  is  to  realize  that  the 

condition  vf’w  =0  means  that  the  gradient 
direction  (which  is  defined  for. nonzero  gradients) 
)s  orthogonal  to  the  direction  u>  1  of  extremized 
curvature . 

2.2.4.  Ravine 

A  ravine  (valley)  is  identical  to  a  ridge 
except  that  it  is  a  local  minimum  rather  than 
maximum  in  one  direction.  As  we  walk  along  the 
ravine-line,  the  points  to  the  right  and  left  of  us 
are  higher  than  the  one  we  are  on  (see  Fig.  2).  A 
point  is  classified  as  a  ravine  if  it  satisfies  any 
one  of  the  following  three  sets  of  conditions: 

I Ivf I  I  0,  X  >  0,  vf  =0 

or 

I  Ivfl  I  *  0,  X  >  0,  vf ’a/2’  =  0 

or 

llvfl I  =  0,  Xj  >  0,  X2  =  o. 


A  saddle  occurs  where  there  is  a  local  maximum 
in  one  direction  and  a  local  minimum  in  a 
perpendicular  direction  A  saddle  must  therefore 
have  positive  curvature  in  one  direction  and 
negative  curvature  in  a  perpendicular  direction. 
At  a  saddle,  the  gradient  magnitude  must  be  zero 
and  the  extrema  of  the  second  directional 
derivative  must  have  opposite  signs.  A  point  is 
classified  as  a  saddle  if  it  satisifies  the 
following  conditions: 

I  Ivf  |  |  =0,  <  0. 

2.2.6.  Flat 

A  flat  (plain)  is  a  simple,  horizontal 
surface,  as  illustrated  in  lig.  3.  It,  therefore, 
must  have  zero  gradient  and  no  curvature.  A  point 
is  classified  as  a  flat  if  it  satisfies  the 
following  conditions: 

I  Irf  I  |  =  0,  Xj  =  0,  X.  -  0. 

Given  that  the  above  conditions  are  true,  a 
flat  may  be  further  classified  as  a  foot  or 
shoulder.  A  foot  occurs  at  that  point  where  the 
flat  just  begins  to  turn  up  into  a  hill.  At  this 
point,  the  third  directional  derivative  in  the 
direction  toward  the  hill  will  be  nonzero,  and  the 
surface  increases  in  this  directicn.  The  shoulder 
is  an  analogous  case  and  occurs  where  the  flat  is 
ending  and  turning  down  into  a  hill.  At  this  point, 
the  maximum  magnitude  of  the  third  directional 
derivative  is  nonzero,  and  the  surface  decreases  in 
the  direction  toward  the  hill.  If  the  third 
directional  derivative  is  zero  in  all  directions, 
then  we  are  on  a  flat,  not  near  a  hill.  Thus  a  flat 
may  be  further  qualified  as  being  a  foot  or 
shoulder,  or  not  qualified  at  all. 

2.2.7.  Hillside 

A  hillside  point  is  anything  not  covered  by 
the  previous  categories.  It  has  a  nonzero  gradient 
and  no  strict  extrema  in  the  directions  of  maximum 
and  minimum  second  directional  derivative.  If  the 
hill  is  simply  a  tilted  flat  (i.e.,  has  constant 
gradient),  we  call  it  a  slope .  If  its  curvature  is 
positive  (upward),  wc  call  it  a  convex  hill.  If 
its  curvature  is  negative  (downward),  we  call  it  a 
mispye  hill.  If  the  curvature  is  up  in  one 
direction  and  down  in  a  perpendicular  direction,  we 
call  it  a  saddle  hill. 

A  point  on  a  hillside  is  an  inflect  ion  point 
if  it  has  a  zero-crossing  of  the  second  directional 
derivative  taken  in  the  direction  of  the  gradient. 
The  inflection-point  class  is  the  same  as  the  step 
edge  defined  by  Uaralick  (1982),  who  classifies  a 
pixel  as  a  step  edge  if  there  is  some  point  in  the 
pixel's  area  having  a  zero-crossing  of  the  ee'-«nd 
directional  derivative  taken  in  the  direction  of 
the  gradient. 

To  determine  whether  a  point  is  a  hillside,  we 
just  take  the  complement  of  the  disjunction  of  the 
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conditions  given  for  all  the  previous  classes. 
Thus  if  there  is  no  curvature,  then  the  gradient 
must  be  non  zero.  If  there  is  curvature,  then  the 
point  must  not  be  a  relative  extremum.  Therefore, 
a  point  is  classified  as  a  hillside  if  all  three 
sets  of  the  following  conditions  are  true 
represents  the  operation  of  logical  implication): 


From  the  table,  one  can  see  tbat  cur 
classification  scheme  is  complete.  All  possible 
combinations  of  first  and  second  directional 
derivatives  have  a  corresponding  entry  in  the 
table.  Each  topographic  category  has  a  set  cf 
mathematical  properties  that  uniquely  determines 
it . 


and 

and 


X1  "  X2  =  0  ->  *  ^vfl *  *  °* 
A.1  *  0  ->  rfu(1)  jt  0, 

12  i  0  ->  rfo)(2>  A  0. 


Rewritten  as  a  disjunction  of  clauses  rather 
than  a  conjunction  of  clauses,  a  point  is 
classified  as  a  hillside  if  any  one  of  the 
following  four  sets  of  conditions  are  true: 


vf'u/22  A  0,  vf’ti/^2  A  0 

01  (1) 

vfw  AC,  =  0 

°r  (2) 

rfii)  A  0,  X^  =  0 

or 

I  Ivfl  I  i  0,  =  0,  ij  =  0. 

IV e  can  differentiate  between  different  classes  of 
hillsides  by  tbe  values  of  tbe  second  directional 
derivative.  The  distinction  can  be  made  as  follows: 


SLOPE  if  kl  =  k2  *  0 

CONVEX  if  Xj  >=  X2  >=  0.  A  0 

CONCAVE  if  X2  <=  X2  <=  0,  J  A  0 

SADDLE  HILL  if  <  o 

A  slope,  convex,  concave,  or  saddle  bill  is 
classified  as  an  inflection  point  if  there  is  a 
zero-crossing  of  the  second  directional  derivative 
in  tbe  direction  of  maximum  first  directional 

derivative  (i.e.,  the  gradient). 


(Note:  Special  attention  is  required  for  the 

d^nerate  case  Xj  =  X2  A  0.  .vhere  o>  and 
uj  can  be  any  two  orthogonal  directions.  In 
this  case,  there  always  exists  an  extreme  direction 
a  which  is  orthogonal  to  vf,  and  thus  the  first 
directional  derivative  rfm  is  always  zero  in  an 
extreme  direction.  To  avoid  spurious  zero 
directional  derivatives,  we  choose  w**  and 
a)  such  tbat  vf'ti)  A  0  and  vf'iii  A0, 
unless  the  gradient  is  zero.) 


Table  1 .  Mathematical  Properties  of  Topographic 
Structures 
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2.2.8.  Summary  of  tbe  Topographic  Categories 

A  summary  of  the  mathematical  properties  of 
our  topographic  structures  on  continuous  surfaces 
can  be  found  in  Table  1.  The  table  exhaustively 
defines  the  topographic  classes  by  their  gradient 
magnitude,  second  directional  derivative  extrema 
va'ues,  and  tbe  first  directional  derivatives  taken 
in  the  directions  which  extremize  second 
directional  derivatives.  Each  entry  in  the  table 
is  either  0,  +,  -,  or  *,  The  0  means  not 
significantly  different  from  zero;  +  means 
significantly  different  from  zero  on  the  positive 
side;  -  means  significantly  different  from  zero  on 
the  negative  side,  and  ’**  means  it  does  not 
matter.  The  label  ’’Cannot  Occur’’  means  tbat  it  is 
impossible  for  tbe  gradient  to  be  nonzero  and  the 
first  directional  derivative  to  be  zero  in  two 
orthogonal  directions. 


2.3.  The  Invariance  of  the  Topograph ic  Categories 

For  a  proof  on  the  invariance  of  the 
topographic  categories  {peak,  pit,  ridge,  ravine, 
saddle,  flat,  and  hillside),  see  Ilaralick,  Watson, 
and  Laffey  (1983),  or  Laffey  (1983). 

2.4  Ridge  and  Ravine  Continuums 

The  definitions  for  ridge  and  ravine  can  lead 
to  possibly  some  unexpected  results.  For  example, 
all  poirts  on  a  right  circular  cone,  except  the 
vertex,  will  be  labeled  ridge.  Whether  one  wishes 
to  call  these  points  ridge  points  or  something  else 
is  a  matter  of  taste.  These  points  are  classified 
as  ridge  points  because  as  one  walks  up  the  cone 
toward  the  vertex  tbe  points  to  the  left  and  right 
are  lower  than  the  one  you  are  on.  The  continuum 
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of  ridges  nay  or  may  not  be  acceptable  depending 
upon  your  viewpoint.  Further  work  by  Haralick 
forthcoming)  has  partially  solved  this  problem. 

3-0  THE  TOPOGRAPHIC  CLASSIFICATION  ALGORITHM 

The  definitions  of  Section  2  cannot  be  used 
directly  since  there  is  a  problem  of  where  in  a 
pixel's  area  to  apply  the  classification.  If  the 
classification  were  only  applied  to  the  point  at 
the  center  of  each  pixel,  then  a  pixel  having  a 
peak  near  one  of  its  corners,  for  example,  would 
get  classified  as  a  concave  hill  rather  than  as  a 
peak.  The  problem  is  that  the  topographic 
classification  we  are  interested  in  must  be  a 
sampling  of  the  actual  topographic  surface  classes. 
Most  likely,  the  interesting  categories  of  peak, 
pit,  ridge,  ravine,  and  saddle  will  never  occur 
precisely  at  a  pixel's  center,  and  if  they  do  occur 
in  a  pixel's  area,  then  the  pixel  must  carrj  that 
label  rather  than  the  class  label  of  the  pixel's 
center  point.  Thus  one  problem  we  must  solve  is  to 
determine  the  dominant  label  for  a  pixel  given  the 
topographic  class  label  of  every  point  in  the 
pixel.  The  next  problem  we  must  solve  is  to 
determine.  in  effect,  the  set  of  all  topographic 
classes  occurring  within  a  pixels's  area  without 
having  to  do  the  im rossible  brute-force 

conputat ion. 

For  the  purpose  of  solving  these  problems,  wc 
divide  the  set  of  topographic  labels  into  two 
subsets:  (1)  those  that  indicate  that  a  strict, 

local,  one-dimensional  extremum  has  occurred  (peak, 
pit,  ridge,  ravine,  and  saddle)  and  (2)  those  that 
do  not.  indicate  that  a  strict,  local,  one¬ 
dimensional  extremum  has  occurred  (flat  and 
hillside).  By  one-dimens ional .  we  mean  along  a 
line  (in  a  particular  direction).  A  strict,  local, 
one-dimensional  extremum  can  be  located  by  finding 
those  points  within  a  pixel's  area  where  a  zero- 
crossing  of  the  first  directional  derivative 
occurs . 


3.1.  Case  One :  No  Zero-Crossing 

If  no  zero-crossing  is  found  along  either  cf 
the  two  extreme  directions  within  the  pixel's  area, 
then  the  pixel  cannot  be  a  local  extremum  and 
therefore  must  be  assigned  a  label  from  the  set 
(flat  or  hillside).  If  the  gradient  is  zero,  we 
have  a  flat.  If  it  is  nonzero,  we  have  a  hillside. 
If  the  pixel  is  a  hillside,  we  classify  it  further 
into  (inflection  point,  slope,  convex  hill,  concave 
hill,  or  saddle  hill).  If  there  is  a  zero-crossing 
of  the  second  directional  derivative  in  the 
direction  of  the  gradient  within  the  pixel's  area, 
the  pixel  is  classified  as  an  inflection  point.  If 
no  such  zero-crossing  occurs,  the  label  assigned  to 
the  pixel  is  based  on  the  gradient  magnitude  and 
iiessi  an  eigenvalues  calculated  at  the  center  of  the 
pixel,  local  coordinates  (0,0),  as  in  Table  2. 

3.2.  Case  Two :  One  Zero-Cros s ing 

If  a  zero-crossing  of  the  first  directional 
derivative  is  found  within  the  pixel's  area,  then 
the  pixel  is  a  strict,  local,  one-dimensional 
extremum  and  must  be  assigned  a  label  from  the  set 
(peak,  pit,  ridge,  ravine,  or  saddle).  At  the 
location  of  the  zero-crossing,  the  Hessian  and 
gradient  are  recomputed,  and  if  the  gradient 
magnitude  at  the  zero-crossing  is  zero.  Table  3  is 
used . 

If  the  gradient  magnitude  is  nonzero,  then  the 
choice  is  either  ridge  or  ravine.  If  the  second 
directional  derivative  in  the  direction  of  the 
zero-crossing  is  negative,  we  have  a  ridge.  If  it 
is  positive,  we  have  a  ravine.  If  it  is  zero,  we 
compare  the  function  value  at  the  center  of  the 
pixel,  f (0,0)  ,  with  the  function  value  at  the  zero- 
crossing,  f(r,c).  If  f(r,c)  is  greater  than 
f(0,0),  we  call  it  a  ridge,  otherwise  we  call  it  a 
ravine . 

3.3.  Case  Three :  Two  Zero-Cross inas 


So  that  we  do  not  search  the  pixel's  entire 
area  for  the  zero-crossing,  w-  only  search  in  the 
directions  of  extre®e  ...second  directional 

derivative,  u>  and  te  ,  Since  these 

directions  are  well  aligned  with  curvature 

properties,  the  chance  of  overlooking  an  important 
topographic  structure  is  minimized,  and,  more 
importantly,  the  computational  cost  is  small. 


=  ^2  ^  the  directions 

not  uniquely  defined.  We  handle  this 
case  by  searching  for  a  zero-crossing  in  the 
direction  given  by  It  1*rf.  This  is  the  Newton 
direction,  and  it  points  directly  toward  the 
extremum  of  a  quadratic  surface. 
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For  inflection-point  location  (first 
derivative  extremum),  we  search  along  the  gradient 
1  rection  for  a  zero— crossing  of  second  directional 
derivative.  For  one-dimensional  extrema,  there  are 
four  cases  to  consider:  (1)  no  zero-crossing,  (2) 
one  zero-crossing,  (3)  two  zero-crossings,  and  (4) 
more  than  two  zero-crossings.  The  next  four 
sections  discuss  these  cases. 


If  we  have  two  zero-crossings  of  the  first 
directional  derivative,  one  in  each  direction  of 
extreme  curvature,  then  the  Hessian  and  gradient 
must  be  recomputed  at  each  zero-crossing.  Using 
the  procedure  decribed  in  Section  3.2,  we  assign  a 
label  to  each  zero-crossing.  Wc  call  these  labels 
LABF.L1  and  LABFL2  .  The  final  classification  given 
the  pixel  is  based  on  these  two  labels  and  is  given 
in  Table  4 . 

If  both  labels  are  identical,  the  pixel  is 
given  that  label,  In  the  case  of  both  labels  being 
ridge,  the  pixel  may  actually  be  a  peak,  but 
experiments  have  shown  that  this  case  is  rare.  An 
analogous  argument  can  be  made  for  both  labels 
being  ravine.  If  one  label  is  ridge  and  the  other 
ravine,  this  indicates  we  are  at  or  very  close  to  a 
saddle  point,  and  thus  the  pixel  is  classified  as  a 
saddle.  If  one  label  is  peak  and  the  other  ridge, 
we  choose  the  category  giving  us  the  "most 
inlormation, ’ ’  which  in  this  case  is  peak.  The 
peak  is  a  local  maximum  in  all  directions,  while 
the  ridge  is  a  local  maximum  in  only  one  direction, 
Thus,  peak  conveys  more  information  about  the  image 
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surface.  An  analogous  argument  can  be  made  if  the 
labels  are  pit  and  ravir.e.  Similarly,  a  saddle 
gives  us  more  information  than  a  ridge  or  valley. 
Thus,  a  pixel  is  assigned  saddle  if  its  zero- 
crossings  have  been  labeled  ridge  and  saddle  or 
ravine  ar.d  saddle. 

It  is  apparent  from  Table  4  that  not  all 
possible  label  combinations  are  accounted  for.  Some 
combinations,  such  as  peak  and  pit,  are  omitted 
because  of  the  assumption  that  the  underlying 
surface  is  smooth  and  sampled  frequently  enough 
that  a  peak  and  pit  will  not  both  occur  within  the 
same  pixel's  area.  If  such  a  case  occurs,  our 
convention  is  to  choose  arbitrarily  one  of  LABEL1 
or  LABEL2  as  the  resulting  label  for  the  pixel. 


Tabic  2.  Pixel  Label  Calculation  for  Case  One: 
No  Zero-Crossing 
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Table  3.  Pixel  Lable  Calculation 
One  Zero-Crossing 

for  Case  Two: 
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Tabic  4.  Final  Pixel  Classification,  Case  Three: 
Two  Zero-Crossings 


3.4.  Case  Four:  More  than  Two  Zero-Crossings 

If  more  than  two  zero-crossings  occur  within  a 
pixel's  area,  then  in  at  least  one  of  the  extrema 
directions  there  are  two  zero-crossings.  If  this 
happens,  we  choose  the  zero-crossing  closest  to  the 
pixel's  center  and  ignore  the  other.  If  we  ignore 
the  further  zero-crossings,  then  this  case  is 
identical  to  case  3.  This  situation  has  yet  to 
occur  in  our  experiments. 

4.0  SURFACE  ESTIMATION 

In  this  section  we  discuss  the  estimation  of 
the  parameters  required  by  the  topographic 
classification  scheme  of  Section  2  using  the  local 
cubic  facet  model  (Ilaralick  1981).  It  is  important 
to  note  that  the  classification  scheme  of  Section  2 
and  the  algorithm  of  Section  3  are  independent  of 
the  method  used  to  estimate  the  first-and  second- 
order  partials  of  the  underlying  digital  image- 
intensity  surface  at  each  sampled  point.  Results 
from  using  basis  functions  other  than  the  bi-cubic 
polynomial  are  presented  in  (Laffey  1983).  In 
these  experiments  the  cubic  model  performed  best. 

4.1.  Local  Cub ic  Facet  Model 

In  order  to  estimate  the  required  partial 
derivatives,  wc  perform  a  least-squares  fit  with  a 
two-dimensional  surface,  f,  to  a  neighborhood  of 
each  pixel.  It  is  required  that  the  function  f  be 
continuous  and  have  continuous  first-and  second- 
order  partial  derivatives  with  respect  to  r  and  c 
in  a  neighborhood  around  each  pixel  in  the  rc 
plane . 

\¥e  choose  f  to  be  a  cubic  polynomial  in  r  and 
c  expressed  as  a  combination  of  discrete  orthogoncl 
polynomials.  The  function  f  is  the  best  discrete 
least-squares  polynomial  approximation  to  the  image 
data  in  each  pixel's  neighborhood.  More  details 
can  be  found  in  Haralick’s  paper  (1981),  in  which 
each  coefficient  of  the  cubic  polynomial  is 
evaluated  as  a  linear  combination  of  the  pixels  in 
the  fitting  neighborhood. 

To  express  the  procedure  precisely  and  without 
reference  to  a  particular  set  of  polynomials  tied 
to  neighborhood  size,  we  will  canonically  write  the 
fitted  bicubic  surface  for  each  fitting 
neighborhood  as 
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6.  RESULTS 
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It  is  easy  to  see  that  if  the  abol'e  quantities 
are  evaluated  at  the  center  of  the  pixel  where 
local  coordinates  (r,c)  =  (0,0),  only  the  constant 
tf-’S  sill  be  of  significance.  If  the  partials 
need  to  be  evaluated  at  an  arbitrary  point  in  a 
pixel  s  area,  then  a  linear  or  quadratic  polynomial 
value  must  be  computed. 

5 .  SUMMARY  OF  THE  TOPOGRAPHIC  CLASSIFICATION 
SCHEME 


The  scheme  is  a  parallel  process  for 
topographic  classification  of  every  pixel  which  can 
be  done  in  one  pass  through  the  image.  At  each 
pixel  of  the  image,  the  following  four  steps  need 
to  be  performed 


In  this  section,  we  show  the  results  of  the 
topographic  classification  on  some  digital  terrain 
imagery  and  aerial  photographs  used  in  the  Passive 
Image  Navigation  Study. 

6 . 1  Results  on  Digtial  Terrain  Data 

In  Figure  1  we  show  the  results  of  the 
topographic  classification  algorithm  on  digitial 
terrain  data  which  represents  a  roughly  4x17  mile 
strip  of  land  and  ocean  just  east  of  Monterey, 
California.  The  actual  image  resolution  is  121x512 
pixels.  In  Figure  1  we  show  the  results  of  the 
labeling  for  several  of  the  categories.  The  top¬ 
most  shows  the  ravines  in  white,  the  next  shows  the 
riges,  and  then  the  peaks  are  shown.  On  the  bottom 
the  original  grey-level  picture  is  shown. 

The  algorithm  shows  excellent  results  on  the 
digital  terrain  data.  The  ridges  and  ravines 
appear  to  be  robust  enough  for  use  in  a  reference 
topographic  landmark  database.  Sensed  topography 
would  be  matched  against  the  reference  database  for 
navigation  purposes. 

6.2  Resul ts  on  Aerial  Photographs 

In  figure  2  and  3  we  show  the  results  of  the 
classifier  on  a  set  of  aerial  photographs,  it 
seems  evident  that  ravines,  ridges,  and  hillsides 
(slopes)  could  serve  as  reference  data  in  an 
intensity  landmark  database.  Exactly  which 
topographic  categraphic  are  reliable  and  how  they 
should  be  linked  together  and  pruned  remains  a 
topic  of  future  research. 

7.  CONCLUSIONS 


1.  Calculate  the  fitting  coefficients,  k 
through  k  ,  of  a  two-dimensional  cubic 
polynomial  in  an  n-by-n  neighborhood 
around  the  pixel.  These  coefficients  are 
easily  computed  by  convolving  the 
appropriate  masks  over  the  image. 

2.  Use  the  coefficients  calculated  in  step  1 
to  find  the  gradient,  gradient  magnitude, 
and  the  eigenvalues  and  eigenvectors  of 
the  Hessian  at  the  center  of  the  pixel's 
neighborhood,  (0,0) . 


3.  Search  in  the  direction  of  the 
eigenvectors  calculated  in  step  2  for  a 
zerocrossing  of  the  first  directional 
derivative  within  the  pixel's  area.  (If 
the  eigenvalues  of  the  Hessian  are  equal 
and  non-zero,  then  search  in  the  Newton 
direction. ) 


4. 


Recompute  the  gradient, 
magnitude,  and  values  of 
directional  derivative  extrema 
zero-crossing.  Then  apply  the 
scheme  as  described  in 
3.1 - 3.4. 


grad ient 
second 
at  each 
label ing 
Sect  ions 


In  this  paper,  we  have  given  a  precise 
mathematical  description  of  the  various  topographic 
structures  that  which  occur  in  a  digital  image  and 
have  called  the  classified  image  the  topographic 
primal  sketch.  Our  set  of  topographic  categories 
is  invariant  under  gray  tone,  monotonical ly 
increasing  transformations  and  consists  of  (peak, 
pit,  ridge,  ravine,  saddle,  flat,  and  hillside), 
with  hillside  being  broken  down  further  into  the 
subcategories  inflection  point,  slope,  convex  hill, 
concave  hill,  and  saddle  hill.  The  hillside 
subcategories  are  not  invariant  under  the  monotonic 
transformations . 

The  topographic  label  assigned  a  pixel  is 
based  on  the  pixel's  first-and  second-order 
directional  derivatives.  We  use  a  two-dimensional 
cubic  polynomial  fit  based  on  the  local  facet  model 
to  estimate  the  directional  derivatives  of  the 
underlying  gray  tone  intensity  surface.  The 
calculation  of  the  extrema  of  the  second 
directional  derivative  can  be  done  efficiently  and 
stably  by  forming  the  Hessian  matrix  and 
calculating  its  eigenvalues  and  their  associated 
eigenvectors.  Strict,  local,  one-dimensional 
extrema  (^uch  as  pit,  peak,  ridge,  ravine,  and 
saddle)  are  found  by  searching  for  a  zero-crossing 
of  the  first  directional  derivative  in  the 
directions  of  extreme  second  directional  derivative 
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(the  eigenvectors  of  the  Hessian).  We  have  also 
identified  another  direction  of  interest,  the 
Newton  direction,  which  points  toward  the  extremum 
of  a  quadratic  surface.  The  classification  scheme 
was  found  to  give  satisfactory  results  on  a  number 
of  test  images. 

7.1 .  Directions  for  Further  Research 

Further  research  on  the  topographic  primal 
sketch  needs  to  be  done  to  (1)  develop  better  basis 
functions,  (2)  make  use  of  fitting  error,  (3)  find 
a  solution  for  the  ridge  (ravine)  continuum 
problem,  and  (4)  develop  techniques  for  grouping  of 
the  topographic  structures.  Pasis  functions  worth 
considering  include  trigonometric  polynomials, 
polynomials  of  higher  order,  and  piecewise 
polynomials  of  lower  order  than  cubic.  The  basis 
functions  problem  is  to  find  a  set  of  basis 
functions  and  an  associated  inner  product  for 
least-squares  approximation  that  can  correctly 
replicate  all  common  image  surface  features  and  be 
simultaneously  computationally  efficient  and 
numerically  stable.  Fitting  error  needs  to  be  used 
in  deciding  into  which  class  a  pixel  falls.  Noise 
causes  the  fitting  error  to  increase,  and  increased 
fitting  error  increases  the  uncertainty  of  the 
labeling.  Also,  global  knowledge  of  how  the 
topographic  structures  fit  together  could  be  used 
to  correct  the  misclassif ication  error  caused  by 
noise.  The  way  the  neighborhood  size  affects  the 
surface  fitting  error  and  the  classification  scheme 
needs  to  be  investigated  in  detail. 

The  ridge  (ravine)  continuum  problem  needs  to 
be  solved.  It  may  be  that  there  is  no  way  to 

distinguish  between  a  true  ridge  and  a  ridge 

continuum  using  only  the  values  of  partial 

derivatives  at  a  point.  The  solution  may  require 
complete  use  of  the  partial  derivatives  in  a  local 
area  about  the  pixel. 

Host  important  for  the  use  of  the  primal 

sketch  in  a  general  robotics  computer  vision  system 
is  the  development  of  techniques  for  grouping  and 
assembling  topographically  labeled  pixels  to  form 
the  primitive  structures  involved  in  higher-level 
matching  and  correspondence  processes.  How  well  can 
stereo  correspondence  cr  frame-to-f rame  time- 
varying  image  correspondence  tasks  be  accomplished 
using  the  primitive  structures  in  the  topographic 
primal  sketch?  How  effectively  can  the  topographic 
sketch  be  used  in  undoing  the  confounding  effects 
of  shading  and  shadowing9  How  well  will  the 
primitive  structures  in  the  topographic  sketch 
perform  in  the  two-dimensional  to  three-dimensional 
object  matching  process? 
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I  he  following  steps  tire  performed  in  our  approach  tti  monocular 
rccoasirtietion.  lust.  linear  ccnnectcd  structures  in  tftc  image  that  arc 
meant  to  represent  building  boundaries  are  formed.  Ihc  structures  are 
obtained  by  first  extracting  junctions  from  lire  image,  many  of  which 
arise  from  building  corners.  To  obtain  lines  corresponding  to  building 
edges,  we  hypothesize  connections  between  the  junctions,  'lire  2D 
smicaires  are  formed  by  linking  the  junctions  and  hypothesized  lines. 

I  hose  D  structures  ;.trc  then  converted  into  3D  wire  frames  using,  in 
addition  to  ihe  task -specific  assumptions  mentioned  above,  die  following 
two  general  .’.sumptions:  lines  in  the  image  directed  toward  the  vertical 
vanishing  point  arc  vertical  in  3  space,  and  two  lines  that  are  aligned  in 
die  image  arc  also  aligned  in  3-space. 

l-inally,  task  specific  knowledge  is  used  to  convert  the  3D  wire-frame 
description  into  a  surface-based  description,  or  scene  model.  Examples  of 
die  scene  model  are  shown  in  figs.  1 5  and  16. 

4.  extracting  Linoa  and  Junctions 

Ihe  method  we  use  for  extracting  lines  and  junctions  is  die  same  as 
that  used  during  stereo  analysis  in  the  31)  Mosaic  system  (3,  5],  where  a 
junction-based  stereo  matching  approach  is  used.  The  following  is  a  brief 
review  of  Lilts  method. 

Ihe  first  step  is  lo  extract  linear  segments.  A  3x3  Solid  operator  is 
used  to  delect  edge  points.  These  are  then  thinned  using  a  modified 
Nevada  and  liabu  algorithm  (7),  as  shown  in  big.  2.  lire  resulting  edge 
points  are  linked  and  approximated  by  piecewise  linear  segments  using 
the  iterative  end-point  fitting  algorithm  [2.  7|.  Finally,  short  lines  arc 
discarded.  Ihc  resulting  line  image  corresponding  to  Fig.  I  is  shown  in 
f  ig-  3. 

i  lie  next  step  is  to  extract  junctions  from  die  line  image.  A  junction  is 
a  group  of  line  segments  (legs)  in  the  image  that  meet  at  a  point.  Wc 
consider  the  following  four  junction  types:  I,  ARROW,  FORK,  and 
l.lo  find  junctions,  a  5x5  window  around  each  end  point  of  each  line  is 
scotched  for  ends  of  oilier  lines.  1  f  the  window  contains  the  ends  of  three 
lines,  they  arc  classified  as  an  ARROW,  FORK,  or  T  junction  depending 
on  die  angles  between  the  lines.  If  the  window  contains  the  ends  of  two 
lines,  they  arc  classilietl  as  an  L  junction.  If  die  window  contains  more 
titan  three  lines,  each  set  of  two  lines  is  assumed  to  form  a  distinct  I, 
junction.  Junctions  that  have  been  found  in  diis  manner  arc  labeled  in 
big.  3.  Notice  diat  many  of  die  junctions  correspond  to  building  comers. 

5.  Locating  2D  Structures 

After  lines  and  junctions  arc  extracted,  connected  structures  are 


formed  by  hypothesizing  new  lines  to  connect  junctions.  These  lines  arc 
meant  to  coi respond  to  building  edges.  I  wo  steps  arc  used  in  die  process 
of  hypothesizing  connecting  lines.  First,  two  junctions  nu.y  be  connected 
only  if  a  leg  of  one  points  at  the  other,  i.c.,  the  extended  leg  meets  the 
other  junction.  Second,  die  two  junctions  must  appear  to  be  connected 
by  line  segments  in  die  line  image. 

1  lie  first  step  involves  finding  all  pairs  ofjunctions  such  that  one  has  a 
leg  pointing  at  the  other,  and  proceeds  as  follows.  First,  If  two  junctions 
share  die  same  leg,  diey  are  connected.  Next,  for  each  leg  of  each 
junction  a  thin  rectangular  window  is  located  in  the  direction  along 
the  leg  (Fig.  4).  Of  die  junctions  w  ithin  diis  window  and  within  an  angle 
a  from  the  direction  of  the  leg,  die  one  closest  to  J,  is  retained  as  a 
candidate  for  being  connected  to  big.  5  shows  a  graph  witli  all 
candidate  connections  drawn. 

Only  the  connections  in  Fig.  5  dint  appear  as  connections  in  die  line 
image  (big.  3)  arc  retained.  The  following  procedure  is  used  to  determine 
this.  For  each  pair  of  connected  junctions  J ,  and  .4,  wc  find  all  segments 
in  the  line  image  that  arc  contained  within  a  thin  rectangular  window 
connecting  J,  and  Jk  (Fig.  6).  and  project  these  segments  onto  the  line 
connecting  the  two  junctions.  Next  wc  consider  how  much  of  this  line  is 
covered  by  projected  segments,  ‘lbe  connection  between  Jt  raid  .4  is 
retained  only  if  the  percentage  of  coverage  exceeds  a  threshold. 

Ihe  result  of  this  pruning  step  is  shown  in  lug.  7.  Note  diat  it  does  a 
good  job  in  eliminating  unwanted  connections.  We  also  tried  another 
method  to  perform  die  pruning,  it  involved  applying  the  I  lough 
transform  to  a  thin  rectangular  region  in  the  edge  image  (Fig.  2)  between 
each  connected  pair  ofjunctions  to  determine  whether  a  line  connecting 
the  junctions  could  be  found.  The  results  of  this  method  were  not  as 
good,  probably  because  the  high  number  of  texture  edge  points 
suggested  too  many  lines. 

At  this  point  in  the  processing,  junctions  have  two  kinds  of  legs,  those 
extracted  in  the  junction  finding  step  and  those  hypothesized  as 
connections  between  junctions.  Some  of  these  legs  arc  extraneous  and 
arc  eliminated  as  follows.  For  each  connected  pair  of  junctions,  if  one 
has  a  leg  that  poin's  at  the  other,  the  leg  is  deleted,  for  it  is  replaced  by 
die  hypothesized  leg.  Next,  wc  utilize  the  assumptions  that  till  vertices  in 
the  scene  arc  trihedral  and  have  one  vertical  and  two  horizontal  legs,  and 
diat  lines  in  the  image  directed  toward  die  vertical  vanishing  point  arc 
vertical  in  the  scene.  First  each  junction  leg  is  labeled  "vertical"  or 
"horizontal"  depending  on  whether  or  not  it  is  directed  toward  the 
vertical  vanishing  point.  Then,  if  a  junction  has  more  than  one  "vertical" 
or  two  "horizontal"  legs,  the  extra  ones  arc  deleted  according  to  a 
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vector  u: 


priority  scheme .  dial  rates  hypothesized  legs  higher  than  originally 
extracted  ones,  and  rates  hypothesized  legs  connecting  two  junctions  that 
were  originally  mutually  pointing  higher  than  legs  connecting  junctions 
where  only  one  was  originally  pointing  at  the  other.  The  extra  legs  with 
the  lowest  priorities  arc  deleted.  'I Tic  resulting  legs  are  shown  in  Fig.  8. 
At  tit  is  point,  we  have  21)  connected  structures  representing  portions  of 
building  boundaries. 

6.  Obtaining  3D  Wire  Frames 

In  order  to  obtain  31)  information  from  the  21)  structures  in  Fig.  .8, 
we  assume  that  the  vertical  vanishing  point  is  known,  that  die  camera 
focal  length  is  know  n,  that  all  lines  part  of  the  21)  structures  (Fig.  8)  arise 
from  either  vertical  or  horizontal  scene  edges,  and  that  the  lines  can  be 
labeled  as  such  according  to  whether  or  not  they  arc  directed  torward  (he 
vertical  vanishing  point.  First,  we  calculate  the  vector  from  die  focal 
point  to  the  vertical  vanishing  point.  This  results  in  a  3-spacc  vector  in 
the  vertical  direction  [1],  which  will  be  very  useful  for  our  processing. 

Suppose  we  want  to  recover  the  31)  configuration  of  die  junction 
ftftftft  'n  Fig.  9  under  the  assumptions  outlined  above.  Suppose  also 
that  line  ft/y,  has  been  labeled  "vertical"  and  lines  pip1  and  ftft  have 
been  lahclcd  "horizontal".  I.et /be  die  focal  length,  and  let  ubc  the  unit 
vector  in  the  vertical  direction.  The  vector  u  is  normal  to  all  horizontal 
planes.  First  we  would  like  to  determine  the  3-spacc  position  of  v2, 
corresponding  to  the  junction  point  ft.  Since  it  is  impossible  to 
determine  die  actual  position  of  this  point  from  a  single  image  without 
special  information,  the  position  is  determined  as  some  arbitrary  point 
lying  on  the  ray  dirough  ft.  If  the  focal  center  is  the  origin  of  the 
coordinate  system  and  ft  =  (ft',)'/,  —f)  and  i£=(jr;,>>2,z2),  then 


Although  this  technique  permits  us  to  recover  the  31)  configuration  of 
any  junction  relative  to  some  nrhitrary  depth,  it  is  not  useful  to  apply  it 
directly  to  the  junctions  in  the  original  line  image  (Fig.  3)  because  die 
relative  heights  above  die  ground  plane  of  the  corresponding  vertices 
cannot  be  determined:  the  height  of  each  vertex  is  arbitrarily  chosen 
without  relation  to  the  heights  of  odicr  vertices.  It  is  more  useful, 
however,  to  apply  the  technique  to  die  21)  structures  in  Fig.  8,  since  the 
heights  of  the  vertices  within  each  structure  can  be  related.  To  see  how 
this  is  done,  consider  die  example  in  Fig.  10,  which  shows  a  21)  structure. 
The  solid  lines  arc  part  of  the  extracted  structure,  while  the  dashed  lines 
arc  for  the  reader's  convenience  to  make  the  31)  shape  more  apparent. 
Suppose  lines  p{pb  and  ftft  have  hcen  labeled  "vertical",  while  die  other 
solid  lines  have  been  lahclcd  "horizontal".  Applying  our  technique  to 
(say)  point  ft,  the  3-space  positions  of  the  vertices  corresponding  to 
points  ft,  ft,  and  ft  can  be  determined  relative  to  some  arbitrary  depth  a 
for  ft.  If  die  technique  is  applied  next  to  point  ft,  die  3  space  position  of 
point  ft  can  be  determined  as  a  function  of  die  depth  a.  This  procedure 
continues  with  points  ft,  ft.  and  so  on.  until  the  31)  configuration  of  the 
whole  structure  has  been  determined,  relative  to  some  arbitrary  depth. 

In  order  to  obtain  a  coherent  scene  description,  die  depths  of  die 
different  structures  in  the  scene  must  be  related.  We  use  two  methods  to 
do  this.  The  first  mcdiod  involves  finding  structures  that  lie  on  the 
ground  plane.  Suppose  a  junction  point  p  of  such  a  structure  is 
hypothesized  to  arise  from  a  vertex  lying  on  the  ground.  Then  die  3- 
spacc  position  /of  the  vertex  is 


where  a  is  die  arbitrary  distance  from  v2  to  the  focal  point. 

The  equation  of  the  horizontal  plane  Vji'2v3  can  now  be  established  as 

— V 

ru=  v2« 


where  u  is  ti  e  normal  to  die  plane  and  /is  any  point  contained  by  the 
plane.  The  3-spacc  positions  of  die  points  i’,  and  v,  can  then  bo 
computed  as  the  intersections  of  this  plane  with  the  rays  through  p ,  and 
ft.  respectively.  i,c„ 


Finally,  die  3-spacc  position  of  die  point  v4  is  computed  as  the 
intersection  of  die  ray  through  ft  with  the  line  through  v,  along  the 


where  u  is  the  unit  vector  in  the  vertical  direction  (thus  normal  to  the 
ground  plane)  and  </  is  an  arbitrarily  chosen  distance  from  the  origin  to 
the  ground  plane.  Since  the  3-spacc  position  of  all  junctions  arising  from 
ground  points  can  be  calculated  in  diis  manner,  the  depdis  of  all 
structures  containing  such  points  can  be  related  to  one  another  through 
the  parameter  d. 

To  hypothesize  junctions  in  the  21)  structures  (Fig.  8)  that  arise  from 
vertices  lying  on  die  ground  plane,  we  use  the  observation  dial  if  a  line 
labeled  "vertical"  connects  two  junctions,  die  line  is  directed  toward  the 
vertical  vanishing  point  with  respect  to  one  junction,  but  away  from  this 
vanishing  point  with  respect  to  the  other  junction.  The  latter  junction  is 
assumed  to  represent  a  vertex  lying  on  the  ground  plane.  Points  ft  and  ft 
in  Fig.  10  arc  examples  of  such  junctions.  The  3-spacc  positions  of  these 
junctions  arc  then  calculated,  and  their  values  arc  propagated  diroughout 
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their  structures  as  described  previously,  big.  11  depicts  a  perspective 
view  of  the  31)  wire  frames  obtained  in  this  mariner. 

1  here  are  many  structures  in  big.  8  that  do  not  contain  points  lying  on 
tile  ground  plane,  either  because  such  points  arc  occluded  in  trie  scene  or 
hccausc  they  have  not  been  properly  extracted  from  the  image. 
Ncvcrthlcss.  the  heights  of  some  of  these  structures  can  be  hypothesized 
using  the  rule  that  if  two  lines  arc  aligned  in  the  image,  assume  they  arc 
also  aligned  in  3-spacc  [6],  To  see  how  this  rule  is  used,  consider  Fig,  12. 
Suppose  that  points  pt  through  p1  have  already  been  assigned  3D 
coordinates,  and  we  want  to  obtain  die  3-spacc  position  of  the  2D 
stau. turc  pj)9plupu.  Since  die  lines  pj>1  and  pj>u  arc  aligned  ir  the  image 
and  both  arc  labeled  "horizontal",  they  arc  assumed  to  be  aligr,  -d  in  the 
scene  and  to  lie  in  the  same  horizontal  plane.  The  equation  of  th  s  plane 
is 

where  i;  is  the  3-spacc  point  corresponding  to  and  u  is  the  unit 
vector  in  the  vertical  direction.  The  3-spacc  position  of  the  point  pt  is 
therefore  determined  as  the  intersection  of  diis  plane  with  die  ray 
through  /)„.  or 


'1  lie  31)  coordinates  of  this  point  may  then  be  propagated  to  points  p9, 
Piu-  and  Pn  as  described  prcvcously.  Note  that  all  31)  positions  arc 
functions  of  the  parameter  il,  which  is  arbitrarily  chosen  for  die  equation 
of  the  ground  plane. 

1  he  following  tests  arc  used  to  determine  whether  two  lines  in  the 
image,  /I  and  /?..  arc  aligned  (Fig.  13): 

1.  They  must  he  almost  parallel,  i.c„  the  smallest  angle  between 
them  must  be  less  than  the  threshold  angle  /? 

2  They  cannot  be  displaced  laterally  by  too  much.  i.c„  12  must 
lie  totally  within  a  band  of  -  threshold  thickness  IV 
surrounding  /l. 

3  They  cannot  be  too  far  from  each  other,  i.c.,  the  ratio  of  the 
gap  g  between  the  two  lines  to  the  average  length  of  the  lines 
must  be  less  than  a  threshold.  Although  this  condition  is  not 
required  for  strict  alignment,  we  include  it  so  that  the 
alignment  rule  will  not  be  applied  to  two  lines  whose  distance 
from  one  another  is  large  compared  to  their  lengths. 

4.  They  must  be  separated  from  each  other,  i.c.,  the  projection 
of  n.  onto  /I  cannot  overlap  /l.  Again,  this  condition, 
although  not  required  for  strict  alignment,  is  included  so  as 
not  to  apply  the  alignment  rule  to  two  lines  that  arc  not 
separated  enough. 


Fig.  14  depicts  a  perspective  view  of  the  final  3D  wire  frames  obtained 
using  both  die  methods  of  hypothesizing  points  on  the  ground  plane  and 
applying  the  alignment  rule. 

7.  Generating  the  3D  Scene  Mode! 

In  order  to  obtain  a  31)  scene  model,  we  use  a  geometric  modelling 
component  that  converts  a  wire-frame  description  of  a  scene  into  a 
surface-based  description.  1  his  geometric  modelling  component  is  also 
used  during  stereo  reconstruction  in  the  31)  Mosaic  system,  since  die 
output  of  the  stereo  analysis  is  a  wire-frame  description  [3,  5, 4]  The 
following  is  a  brief  review  of  die  geometric  modelling  system. 

1°  generate  the  surface- based  description  from  the  wire  frames,  it  is 
assumed  that  the  objects  in  the  scene  can  be  approximated  by  polyhedra, 
that  each  face  is  a  parallelogram  unless  there  is  contrary  evidence,  that 
die  position  of  the  ground  plane  is  known,  and  that  the  objects  have 
walls  that  arc  perpendicular  to  the  ground  plane.  (It  is  not  assumed  that 
the  roofs  arc  parallel  to  the  ground.) 

1  tie  processing  proceeds  as  follows.  First,  wire-frame  edges  dial  arc 
nearly  parallel  and  very  close  to  each  other  arc  merged.  Next,  web  faces 
tha;  correspond  to  corners  of  planar  faces  are  generated  for  each  vertex 
corner.  The  web  faces  diat  represent  corners  of  a  single  face  are  dien 
merged.  After  all  mergers,  faces  that  do  not  have  closed  boundaries  are 
completed  either  as  parallelograms  or  as  other  closed  polygons.  After  all 
faces  arc  completed,  those  that  seem  to  be  holes  in  other  faces  arc 
converted  to  holes.  Finally,  ohjects  that  are  not  closed  arc  completed  by 
dropping  vertical  walls  from  the  roofs  toward  the  ground  plane. 

Perspective  views  of  the  resulting  scene  model  are  shown  in  Kig.  15. 

In  order  to  render  more  realistic  displays,  gray  scale  obtained  from  the 
original  image  (Fig.  1)  is  added  to  diem  (Fig.  16). 

0.  Conclusion 

A  system  lias  been  described  that  reconstructs  the  three-dimensional 
shape  of  a  complex  urban  scene  from  a  single  image.  A  3D  wire-frame 
description  of  the  scene,  meant  to  represent  portions  of  building 
boundaries,  is  generated  first.  The  vviie  frames  are  then  converted  into  an 
approximate  surface-based  31)  model  of  the  scene.  We  have  also 
demonstrated  that  task-specific  knowledge  is  very  useful  for  interpreting 
complex  images.  Such  knowledge  is  used  for  generating  2D  image 
structures,  31)  wire  frames,  and  the  31)  scene  model. 

There  are  several  extensions  and  improvements  we  have  in  mind  for 
the  methods  and  techniques  described  in  this  paper: 

l.'llic  21)  structures  arc  obtained  by  hypothesizing  line 
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connections  between  extracted  junctions.  More  complete  ->|) 
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useful  for  hypothesizing  such  lines. 
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a"d  tCXU"c'  'v,lich  shoill‘l  he  incorporated  into  our 
monocular  analysis. 
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I' idlin'  4:  The  closest  juiicliun  io./,  within  the  thin  tec  [angular 
window  of  length  «';ind  heinlu  II'.  and  within  the  angle  2a, 
is  ;i  candidate  for  being  connected  to 
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Kignrc  5:  latch  line  rcpic'cnls  a  possible  connection  between  the  junctions 
at  its  two  end  points.  F.ach  end  point  corresponds  to  it  junction  in  big.  3. 
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Figure  10:  The  solid  lines  represent  aconncctcd 
2D  structure.  The  dashed  lines  arc  for 
the  reader  s  convenience  to  make  die 
3D  shape  more  apparent. 


Figure  8:  Result  of  adding  to  ITg.  7  the  junction  legs  that 
were  originally  extracted  in  the  junction  finding  step. 
„nd  then  deleting  extraneous  legs. 


Figure  9:  The  3D  configuration  of  the  junction  plp1pip,  can  he 
recovered  under  assumptions  explained  in  die  text. 

0  is  the  focal  center  and  die  origin  of  the  coordinate  system. 


Figure  1  i :  Perspective  view  of  3D  wire  frames  generated 
from  Tig.  8  using  the  method  of  finding  junctions 
arising  from  vertices  lying  on  the  ground  plane. 
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PROGRESS  IN  STEREO  MAPPING 

H.  Harlyn  Baker,  Thomas  0  Binford,  Jitcndra  Malik,  Jean- Frederic  Mellcr 
Artificial  Intelligence  Laboratory,  Computer  Science  Department, 
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1:  Introduction 

tier  than  a  description  of  a  single  piece  of  research,  this  is  more  in 
line  of  a  collective  report  on  some  areas  we’ve  been  addressing  in 
research  in  stereo  mapping.  We  have  been  developing  tools  and 
experimenting  with  matching  strategies  that  will  build  to  a  rule-based 
stereo  matching  system.  In  particular,  vy£  have  been; 

/ 

a)  demonstrating  the  design  and  the  utility  of  the  rule-based 
approach  to  surface  inference  from  monocular  information 
through  the  hand  synthesis  of  matching  strategics; 

b)  developing  tools  to  support  a  mapping  interactive  test  facility; 

c)  experimenting  with  the  [Baker  1981]  stereo  mapping  system, 

preparing  to  run  it  on  some  new  imagery.  _ 

hi  a),  we  have  been  carrying  out  research  aimed  at  the  analysis  and 
synthesis  of  rules  for  inference  of  three-dimensional  shape  from  single 
images.  We  have  been  addressing  inference  of  matching  rules  and 
the  use  of  model-based  analysis  both  with  theoretical  analyses  and 
with  hand  and  automated  analyser  of  specific  matching  strategics;  the 
latter  applied  to  both  real  and  synthesized  imagery  examples.  This 
inference  also  has  application  in  constraining  search  for  matches  in 
stereo  correspondence. 

Our  work  in  developing  tools  has  centered  on: 

a)  a  system  for  the  hand  construction  of  edge  descriptions  from 
hard  copy  imagery; 

b)  an  interactive  system  for  determining  the  transform  to  bring 
image  pairs  into  colli  near  e  pi  polar  registration. 

Both  of  these  tools  make  extensive  use  of  interactive  graphics,  and 
the  latter  takes  much  advantage  of  previous  stereo  research  from  our 
laboratory  ([Connery  1080]). 

In  c),  we  have  been  undertaking  to  apply  the  [Baker  1981]  system  to 
some  new  imagery,  hi  this  wo  hope  to  demonstrate  its  effectiveness, 
to  expose  its  limitations,  and  to  suggest  both  its  role  in  an  advanced 
mapping  system  and  complementary  research  needed  to  improve  its 
utility.  Significant  restructuring  of  the  system  was  c  alled  for  in  enabling 
it  to  process  this  new  imagery.  Details  of  these  changes  are  described 
in  section  4,  which  deals  with  the  matching  process.  Modifications  have 
now  been  implemented,  enabling  the  system  to: 

•  function  on  the  output  of  an  improved  edge  operator 
[Marimont  1992]; 

•  use  edge  extent  as  one  of  its  parameters  in  seeking 
optimized  correspondence; 

•  exploit  prepared  transform  information  in  processing 
images  whose  cpipolar  lines  arc  not  colli  near  with  the 
scanning  axes  of  the  cameras. 

We  will  describe  results  in  these  areas  of  the;  research  through  discussion 
of  the  following: 

a)  the  use  of  image  edge  descriptions  produced  using  the  digitiz¬ 
ing  facility  and  from  an  automated  process  [Marimont  1982] 
in  synthesizing  rules  for  stereo  matching; 

b)  development  of  a  system  for  cpipolar  registration  of  image 
pairs; 

c)  modifications  to  the  [Baker  1981]  stereo  system  (results  later). 


2:  Inference  and  Modelling 

2.1  Preamble  -  tlie  digitizing  facility 

Our  approach  to  rule  development  begins  with  hand  synthesized 'and 
some  automatically  generated  edge  data.  We  have  systems  for  the 
automatic  generation  of  edge  data  (:.c.  [Marimont  1982]).  This  data, 
extracted  from  a  sufficiently  wide  selection  of  imagery  types,  gives  good 
insight  into  the  current  capabilities  of  automated  processes.  Automated 
processes,  however,  arc  not.  able  presently  to  give  as  meaningful  a 
description  of  an  image  as  we  would  like,  and  have  not  been  designed 
to  provide  the  aggregated  abstractions  research  systems  ([Lowe  1982]) 
will  be  soon  supplying.  To  bridge  this  in  adequacy,  we  work  with  both 
automatically  generated  data  (the  current  state-of-the-art),  and  hand 
generated  data  (representative  of  the  next  genet ation  of  edge  analysis 
processes).  The  hand  generated  data  is  obtained  from  a  manually 
operated  digitizing  tablet.  We  have  written  a  graphics- bused  digitizing 
and  editing  system  to  run  with  a  GTCO  tablet  in  producing  these 
image  descriptions. 

Figure  2- 1  below  shows  an  image  pair  of  a  building  complex  (referred 
to  as  the  Sacramento  imagery).  Figure  2-2  shows  the  results  of  tablet 
edge  extraction  on  these  images.  Figure  2-3  shows  the  results  of  the 
Marimont  operator  [Marimont  1982]  on  the  image  pair.  Manually 
generated  edge  data  was  produced  using  this  facility  for  the  analysis 
of  rule  synthesis  of  section  2.  It  was  also  used  to  digitize  the  building 
data  of  figures  2-1  for  input  to  the  OTV  inference  process,  as  figure 
2-5. 

2.2  Data  for  Rule  Synthesis 

We  have  obtained  extended  edge  data  from  hand  and  automated 
processing  for  use  in  synthesis  of  matching  rules.  Results  from  ear¬ 
lier  work  on  OTV  analysis  (orthogonal  trihedral  vortices)  have  been 
exploited  [IVrkins  1968]  for  rule  formation  in  shape  inference  and  in 
constraining  search  for  correspondence.  We  have  taken  examples  from 
the  modelling  of  generic  structures  to  prodoce  ground  and  aerial  views 
of  a  building  complex,  and  have  used  this,  as  well  as  other  data,  in  rule 
synthesis. 

2.3  Modelling  and  Vision 

2.3.1  -  Modelling,  prediction  and  interpretation 

Of  course,  one  of  the  primary  goals  of  research  in  computer  vision  is 
the  development  of  systems  that  can  recognize  and  locate  objects  in 
images.  In  order  to  identify  such  an  object,  it  is  clearly  necessary  to 
have  some  description  of  its  characteristics  that  can  be  detected  in  an 
image.  A  representation  of  an  object  in  the  form  this  description  takes. 

One  approach  to  representation  is  to  provide  the  system  with  three- 
dimensional  models  of  objects.  Rotation  of  these  models  will  allow 
objects  to  be  observed,  conceptually,  from  differing  viewpoints.  If 
parameters  in  a  particular  model  arc  allowed  to  vary  it  is  possible  to 
liavc  that  single  model  represent  a  whole  class  of  objects;  constrain¬ 
ing  the  parameters  functions  to  delimit  sub-classes.  Further  model 
manipulations,  such  as  partitioning  and  projection,  can  be  used  to  aid 
in  mapping  model  to  imagery  data.  The  information  contained  in  such 
object  models  may  he  used  to  determine  possible  interpretations  of 
image  features  (e  g.  edges,  ribbons,  corners)  and  to  provide  feedback 
to  predict  the  locations  of  such  features  in  an  image. 

ACRONYM  [Brooks  1981]  is  a  three-dimensional  rule-based  modelling/ 
vision  system  developed  here  at  Stanford  that  provides,  among  other 
things,  such  feature  prediction,  model  manipulation,  and  image  inter- 
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Sacramento  Imagery 
Figure  2-1 


prctalion.  The  rule-base  operates  on  the  models  and  on  the  sensed 
data  to  accomplish  scene  interpretation.  Such  a  rule-based  approach 
has  been  shown  to  be  an  ell'ective  form  for  constraint  and  search  im¬ 
plementation,  and  allows  easy  modification  and  addition  of  new  rules 
without  the  need  of  altering  the  underlying  code. 

Our  group’s  intention  over  the  next  few  years  is  to  build  a  rule-based 
stereo  system  operating  within  ACIIONYM  whose  functioning  will  in¬ 
clude  model-based  prediction.  Working  toward  this,  we  have  been  car¬ 
rying  (nit  experiments  on  scene  inference  and  model-based  prediction 
that  will  lead  to  a  repertoire  of  stereo  matching  rules. 

2.3.2  —  Models  and  stereo  matching 

One  of  the  major  dilficn It ies  in  determining  stereo  correspondence  is  in 
dealing  with  the  large  number  of  matches  that  are  possible.  Solution 
is  generally  found  by  search  through  a  large  parameter  space,  where 
possi hie  correspondences  are  limited  by  geometric  or  photometric  con¬ 
straints.  Search  can  be  reduced  even  more  dramatically  hy  endow¬ 
ing  the  matcher  with  broad  domain  specilic  and  domain  in  dependent 
knowledge.  Such  knowledge  can  be  rule-based  and  model-based.  Onr 
proposition  here  is  that  the’  three-dimensional  information  in  object 
models,  along  with  inference  and  prediction  mechanisms,  ran  be  used 
to  interpret  features  in  image  pairs.  These  interpretations  can  then  be 
used  as  litters  to  constrain  the  matching.  We  demonstrate  this  notion 
with  the  example  of  Orthogonal  Trihedral  Vertices,  often  referred  to  .as 
cube  corners  or  OTVs,  Other  rales  synthesized  from  analysis  of  both 
manually  extracted  and  automated  edge  processes  follow. 


The  work  on  cube  corners  points  to  additional  usefulness  for  a  model- 
based  approach.  OTV  orientation  analysis  (from  matches  across  pairs 
of  views)  yields  almost  complete  solution  for  camera  parmameters; 
constraints  on  sizes  (again,  from  '•ules  and  models)  could  complete 
the  camera  solution.  But  the  orientation  information  yielded  by  a 
match  of  a  pair  of  vertices  is  valid  only  if  the  vertex  is  a  cube  corner; 
thus  it  is  necessary  to  be  able  to  distinguish  between  vertices  that  arc 
cube  corners  and  those  that  are  not.  If  the  models  contain  sufficient 
information  to  identify  cube  corners,  then  the  problem  of  determining 
cube  corners  independently  of  the  identification  process  is  eliminated. 
In  fact,  both  the  search  for  cube  corners  and  the  search  for  identification 
arc  likely  to  be  reduced  when  they  are  combined. 

2.3.3 OTV_rule -based  analysis 

OTV  theory 

In  cultural  scenes,  we  find  a  large  number  of  interior  and  exterior 
corners  of  cubes  typically  when  two  walls  at  right  angles  meet  the  roof 
or  the  lloor.  I  lie  importance  of  utilizing  this  common  structural  ele¬ 
ment  the  Orthogonal  I riliedral  Vertex  (OTV)  lias  been  emphasized 
earlier jl.iobcs  1981  ]  Since  they  provide  a  very  tight  constraint  the 
three  edges  arc  mutually  orthogonal  in  space  it  is  possible  to  calculate 
the  three  dimensional  orientations  given  the  projections  in  the  image. 

I  his  can  be  done  for  both  orthographic  and  perspective  viewing. 

If  the  eye  is  assumed  to  be  focussed  on  the  vertex  of  the  cube  corner, 
perspective  run  be  ignored  and  the  projection  of  a  cube  corner  in 
X)  /.  space  will  simply  be  its  orthogonal  projection  on  the  XY  plane. 
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Suppose  that  some  3  -  star  has  angles  between  its  rays  a,  b  and  c  and 
also  that  the  rays  are  represented  by  the  unit  vectors  V\,  V2,  V3.  VVc  are 
interested  in  detecting  whether  the.'*  -re  three  vectors  in  XY%  space, 
which  are  mutually  orthogonal  and  project,  respectively  to  v\,  u2,  V3. 

Since  projection  is  accomplished  by  dropping  the  z  component,  any  3 
such  vectors  must  be  of  the  form  «|  +  X,2,  V2  +  X2:?,  and  V3  f  X32  where 
z  is  the  unit  vcctoi  in  the  z- direction. 

Requiring  mutual  orthogonality  implies  that  the  dot  products  of  these 
vectors  in  pairs  be  zero.  From  these  conditiuns  and  some  simple 
manipulations  wo  can  calculate  the  formulas  for 

v  _  .  /  (co*  o )( cos  c )  /  (coi  a)(cos  b)  /_ UoHC)(cQsb) 

1  Y  (cos  6)  *  2  ±\/  (cos  c)  ‘  (coso)' 

lienee  solutions  exist  if 

a)  cos  a,  cos  6,  cosc  are  all  non- zero  and 

b)  cither  one  or  three  of  cos  a,  cos  6.  cosc  arc  negative,  so  that 
the  quantities  under  the  square  root  sign  are  positive. 

These  results  were  first  derived  in  (Perkins  1968]. 

Thus  we  have  a  way  of  both  eliminating  false  candidates  for  being 
OTVs  and  finding  the  3-1)  orientations  of  valid  OTV’s.  This  algorithm 
has  been  implemented  and  run  on  data  from  the  digitizing  tablet. 

OTV  with/from  models 

Our  analysis  begins  on  hotli  images,  processing  bottom  up  on  the  two 
images  separately.  As  the  rule  system  identifies  likely  OTVs  in  images 
(from  its  models),  it  proceeds  to  match  them.  The  system  shunld 
already  have  a  tentative  identification  of  the  buildings  containing  the 
OTVs,  so  there  should  be  relatively  few  possible  matches  at  this  point. 
Only  OTVs  that  could  be  the  same  point  on  the  same  object  ne<d  be 
compared.  The  analysis  results  in  depths  of  matched  ohjccts,  for  all 
those  objects  having  OTVs. 

This  requires  that  the  modelling  system  handle  point  elements,  and 
that  it  include  both: 

•  inferring  OTVs  from  models  (volumes); 

•  accessing  OTVs  stored  explicitly  with  the  models. 

2.4  Rule  Synthesis 

2..{  1  Inference  ruler, 

We  continue  with  the  development  of  inference  rules.  This  work  is  a 
logical  extension  of  previous  work  [Binford  1981,  Uowc  1982]  done  at 
Stanford  in  developing  rules  for  inferring  surface  information  from  a 
single  view.  General  assumptions  about  illumination,  object  geometry, 
the  imaging  process  etc.  Imvn  been  used  to  derive  rules  for  making 
specific  inferences.  For  stereo  vision  Arnold  and  Binford  (Arnold  1980] 
have  developed  conditions  on  correspondence  of  edge  and  surface  inter¬ 
vals.  We  divide  our  rides  into  two  categories:  monocular  rules,  which 
enable  surface  inference  from  a  single  view;  and  stereo  rules,  which 
facilitate  cross-image  matching. 

2.4.2  Monocular  and  stereo  rules. 

I  Monocular  rules  Rules  which  have  been  developed  for  in¬ 
ferences  from  monocular  views  can  be  utilised  to  provide  a 
partial  3-dirnem.iofial  interpretation  which  directs  search  in 
the  second  view.  This  category  includes  the  rule  for  inter¬ 
pretation  of  Orthogonal  Trihedral  Vertices. 

Another  example  is  the  T-junction  rule  [Binford  1981]  which 
states  that  7n  absence  of  evidence  to  the  contrary,  the  stem 
of  a  T  ts  not  nearer  than  the  top,  t.e.  is  coincident  in  space 
or  further  away*.  Application  of  this  rule  gives  a  set  of 
nearer/farther  relations.  A  hyputhesized  correspondence  of 
edges  which  leads  to  inconsistent  conclusions  from  the  two 
views  can  be  pruned  from  the  search. 


An  image  line  which  is  straight,  must,  be  the  image  of  a  straight 
space  curve  unless  the  curve  is  planar  and  the  observer  is  coin¬ 
cidentally  aligned  with  the  plane  of  curvature.  This  enables  ns 
to  dismiss  correspondences  between  straight  edges  in  one  view 
and  curves  in  the  other  view.  If  two  image  curves  are  projcc- 
tively  consistent  with  parallel,  we  assume  they  arc  images  of 
curves  which  are  parallel  in  space.  That  implies  that  their 
images  in  the  other  view  would  be  parallel  i  t.  parallels  map 
to  parallels. 

As  these  examples  illustrate,  most  of  the  rules  in  (Binford  1981, 
Lowe  1982]  and  others  developed  by  Malik  and  Binford  have 
sis  direct  corollaries  stereo  rules  for  checking  the  legality  of  a 
match.  They  can  even  direct  the  search  process. 

2.  Stereo  Rules  those  are  rules  which  have  hern  derived  from 
the  stereu  imaging  process,  and  are  a  function  of  the  imaging 
geometry. 

An  example  rule  in  this  class,  which  has  long  been  used  for 
finding  stereo  correspondences,  is  the  cpipolars  rule  -  cor¬ 
responding  points  must  lie  on  corresponding  cpipolar  lines. 
These  rules  have  inherently  no  monocular  analogs.  Mere  are  a 
few: 

a)  Horizontal  planes  in  one  view  get  mapped  to  horizon¬ 
tal  planes  in  the  other  view. 

b)  Use  of  projective  and  quasi-projcctive  invariants.  This 
has  not  been  examined  in  detail.  Duda  and  llart(l)uda 
1973]  devote  a  chapter  to  this  topic  which  has  not 
really  heen  exploited  in  stereo  work. 

c)  Conditions  on  correspondence  of  edges  and  surface 
intcrvals|Arnohl  1980). 

d)  Surface  Occlusion  rules: 

Surfaces  visible  in  one  view  can  be  occluded  in  the 
other  view.  Wc  are  interested  in  the  conditiuns  when 
this  takes  place.  The  basic  idea  is  that  if  we  cross  a 
surface,  an  obscuration  of  edge  occurs.  A  left  surface 
visible  in  a  right  view  is  visible  in  the  left  view  un¬ 
less  there  is  obscuration  by  a  tall  object.  Similarly  a 
right  surface  visible  in  n  left  view  will  be  seen  unless 
obscured  i>y  a  tall  object.  These  surface-obscuration 
rules  can  be  formalized  by  tire  cross-product  rule: 


For  the  hypothesized  edge  match  e\  with  f\  and  e2 
with  /<2 .  we  compute  the  Z* component  of  the  vec¬ 
tor  cross-product  in  the  left  image  pair  and  the  right 
image  pair.  If  the  z-components  have  opposite  signs, 
wc  are  seeing  opposite  aides  of  the  surface.  That  im¬ 
plies  that  the  object  is  not  opaque. 

2.4*9  —  Use  of  inference  rules  in  a  test  analysis 

Our  preliminary  results  indicate  good  potential  for  the  success  of  this 
approach.  On  hand  simulations  with  line  drawings  of  stereo  pairs,  the 
rules  helped  narrow  down  the  choices  considerably. 

Consider  the  imagery  shown  in  figures  2-^  and  2-5.  Figure  2-S  is  the 
right  view  and  2-4  the  left  view.  Vertices  1, 2,  3  arc  orthogonal  trihedral 
vertices.  Using  he  formulae  developed  earlier,  wc  can  find  the  3  D 
orientations  of  tl  c  edge  vectors.  These  can  be  matched  with  the  3-D 
orientations  of  !',  2',  3'  to  obtain  a  registering  of  these  vertices  when 
combined  with  the  cpipolar  constraint.  All  OTV’s  in  one  view  need  not 
be  visible  in  the  other  i.t.  d'.  Of  the  monocular  constraints,  the  other 
major  constraints  which  can  be  seer  here  are  I  hr  T- junction  rule  and 
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and  E2M-2  b  collinear  to  this  vector  Suppose  that  in  0\xyz,  i?,  AY, : 
(Ti  3/1)0).  In  terms  of  components  in  02xyz : 


1-2  AC  is  collinear  to: 

(vx'\  -  Xx',\  _  ( Xi(i/ru  —  Xr3,)  +  y,(pri2  -  Xr32)^ 

V'y' i  Pz'\]  Vxi(i/r2,  -  pr3i)  +  yi(i/r22  -  pt32)J 

If  we  let  A  be  the  matrix: 

(Vrn  -Xr3l  isr] 2  —  Xr32\ 

\.i/r._>i  -  pr3 1  VT-ii  -  iit32) 

1  hen  we  can  write  C2A f2  =  AExMt  where  f,\Mi  is  in  base  0xxy  and 
E2M2  is  in  base  02xy. 


b)  algorithm 


Case 


Figure 


1 

3-Y 


bet  N,  be  the  number  of  epipolar  lines  that  we  want  to  determine, 
liach  epipolar  line  is  uniquely  determined  by  the  angle  it  makes  with 
the  X-axis,  bet  0  be  this  angle.  (liven  k,  the  epipolar  line  number, 
0  <  k  <  »1  -  1,  how  can  we  determine  0 ?  If  0O  and  0X  arc  the  lower 
and  upper  limits  between  which  0  is  allowed  to  vary,  and  02  = 
then  we  will  choose  the  middle  of  each  interval:  0  =  0o  +  (/;  + ,5)02  Rut 
what  arc  0y  and  0 , ?  We  have  to  distinguish  between  three  cases: 


•  The  epipule  is  i,  the  picture  (very  unlikely).  Then  0 
can  vary  between  0  and  2a  radians; 

•  I  he  epipole  is  outside  the  image:  there  is  a  minimum 
and  maximum  angle  under  which  the  image  is  seen 
from  tills  point.  If  we  choose  these  angles  in  [0, 27r] 
then  most  of  the  time  every  0  P  [0o,0i]  will  delinc  a 
valid  epipolar  line  in  l\; 

•  The  exception  from  above  is  when  the  epipole  is  left 
of  the  image  hut  on  a  same  vertical  level:  then  [00,0,] 
cannot  be  connected  and  still  included  in  [0,2a].  In 
this  case  we  will  choose  the  angles  in  J], 

Then  will  be  defined  by  the  point  E,  and  the  vector  VAco.iO.ainO), 
and  L-i  is  defined  by  E2  and  V2  —  AV j 

3.6. S  CASE  3:  one  epipole  Ex 

a)  theory 


Civen  an  epipolar  line  ,  we  already  know  that  the  corresponding 
epipolar  line  I,2  in  P2  is  collinear  to  the  vector  t  =  V2(X,/i).  Ilene"  we 
just  need  to  find  a  point  belonging  to  i2.  Clearly,  L2  is  the  intersection 


of  /'_>  with  the  plane  (/;, ,  AY, ,  02).  Thus  any  fine  contained  in  tliis  plane 
will  intersect  /’2  at  a  point  contained  in  I  In  particular,  consider  the 
parallel  to  /,,  driven  through  0.,.  It  intersects  /’2,  thus  /,2,  at  A/3  such 
that,  in  base  O^xyz: 


(r  ii*i  +  ri2y,\ 
r-JlXl  +  I-2201  , 
’MX,  +  r32y,J 


hence  0->M.,  — 


(  A,  LilJ  x  I  +  T  J_a  3#  l  \ 

I  *'*  r3,  x,  +r  iii/7)  1 
f,  (r ■* « xi  +rnyi 
J2r3lz,  fr13y,  j  . 

h  ) 


b)  algorithm 

In  the  same  way  as  in  case  I,  we  define  ,  t'i )  where  tjf.T,  --= 

co.s0j, x2  =  sin02).  Then  l2  is  defined  by  (Af3,  Vt),  where  the  coor¬ 
dinates  of  A/ 3  arc 

fj*  h*i  +  rl-ryi  ,  r21 1,  +  r22y,  A 

V  '•3,1,  +  r32i/,’  >31X1  +  ryyyij 


Case  2 
Figure  3-5 

3.6.3  -  CASE  3:  one  epipole  E2 
a)  theory 

bet  an  epipolar  line  in  C,  be  defined  by  the  translation  vector  t  —  V i 
and  by  a  point  AY,  we  pick.  In  P2,  l,2  goes  through  E2  and  is  collinear 
to  a  vector  E2M2,  intersection  of  P2  with  the  plane  (02,  E2,  AY,).  This 
plane  is  orthogonal  to  02AY,  X  t  and  P2  is  orthogonal  to  02z.  Hence 
the  intersection  is  collinear  to  02z  X  (02A/i  X  t) 


since  02M,  =  020,  +  0,AY| 

—  kt  -t-  0|  AY i , 

02AY,  X  t  =  0,AY|  X  i, 

and  02z  X  (02AY,  Xf)  =  (02x  t)0|AYi  -  [02z  ■  0,  AY,)f. 
Suppose  0|  AY|  :  (ii ,y,,/i)  in  0xxyz.  Then  in  02xyz: 


T\  i  zi  +  r 1 2 !/i  t-  r\zf\  \ 
r2i*i  +  r22Vi  +  ras/i  I 

.r3i*i  +  r32 J/i  +  r^fi  J 


/•'‘i  A/ 2  is  Ih ns  collinear  to: 


(ux\  _  (  x  1  (t/f* i  ~  ^r.n)  +  V\[^r\z  -  Xr32)  +  /t -  Xr33)  \ 

\vy\  nz\)  \ii(^r2|  -  ^ir3,)  +  yi(i/r22  —  /if32)  -f  /i(er2:j  —  /*r33)/ 


llencc,  if  we  let  A  hr  the  same  matrix  as  in  CASIO  I  and  01' 3  he  the 
olfscrt  matrix: 


Vr,a  -  Xr3.-i\ 
l/r23  —  ^^.13/ 


4-  Automated,  Stereo  Mapping 
4  1  Background 


Results  from  our  laboratory  over  the  past  few  years  (Quam  1971 
Hannah  197-1,  Moravcc  1980,  Gennery  1980,  Arnold  1980,  Raker  1981,’ 
Arnold  1983  ,  have  demonstrated  the  possibilities  of  both  area-based 
and  feature-based  stereo  matching. 

Area-based  stereo  matching  uses  windowing  mechanisms  to  isolate 
parts  ol  two  images  lor  cross-correlation.  Feature- based  stereo  match¬ 
ing  uses  two-dimensional  convolution  operators  (and  perhaps  grouping 
operators)  to  reduce  an  image  to  a  depiction  of  its  intensity  bound¬ 
aries,  which  can  then  be  put  into  correspondence.  Area- based  cross- 
correlation  tccliniiiiies  require  distinctive  texture  within  the  area  of  cor¬ 
relation  for  successful  operation.  In  general,  it  breaks  track  where  there 
is  no  local  correlation  (zero  signal,  or  where  two  images  do  not  cor¬ 
respond,  t.e.  occlusions)  or  where  the  correlation  is  ambiguous  (where 
the  signal  is  repetitive). 

Demands  of  mapping  in  cultural  sites  and  in  locales  with  surface  discon 
tininty  and  ambiguous  or  non-existent  texture  make  it  essential  that 
il  area-based  analysis  is  lo  he  clone,  it  he  done  in  conjunction  with 
leal  u re- based  analysis.  Feature-  based  analysis  provides  a  solution  to 
many  of  the  problems  of  correlation.  Principal  among  its  advantages 
is  that  It  operates  on  the  most  cliscriininable  parts  of  an  image:  places 
that  are  distinctive  in  their  intensity  variation,  and  where  localization 
is  greatest.  These  are  typically  the  boundaries  between  objects  or  be¬ 
tween  details  on  objects,  or  between  objects  and  their  backgrounds. 

I  lie  important  point  is  that  the  features  being  put  into  correspondence 
lor  depth  estimates  are  the  boundaries  or  objects:  area  based  analysis 
IS  at  Its  worst  at  object  boundaries,  yet  determining  boundaries  can  be 
said  to  be  the  most  important  part  of  mapping  in  3-space. 

I  he  (linker  1981]  system  is  the  only  current  system  that  mixes  these  two 
matching  modalities.  We  have  been  working  at  applying  this  system  to 
tome  cultural  scenes.  Ileforo  carrying  out  these  analyses  we  wished  to: 

a)  enhance  the  system  with  a  capability  to  work  with  a  better 
edge  operator  [Mnrimont  1982]; 

b)  enable  it  lo  process  images  that  are  not  graced  with  collincar 
epipolar  geometry  (i.e.  most  images); 

c)  introduce  an  additional  correspondence  measure  edge  extent. 

4.2  Epipolar  Registration 

To  implement  these  enhancements  required  substantial  redesign  of  the 
system,  and  redesign  cycles  with  the  Marimont  process.  Cliosing  usc- 
ahle  data  also  presented  difficulties,  as  the  only  imagery  available  was 
not  Of  the  correct  geometry  (sec  below).  The  two  image  pairs  initially 
chosen  (the  Sacramento  apartment  complex  and  a  section  of  some  im- 
agery  of  Moffett  I'icld)  proved,  on  closer  examination,  to  require  quite 
complex  transformation,  and  could  not  he  easily  adjusted  for  epipolar 
processing. 


In  general,  to  bring  imagery  data  into  a  properly  transformed  state 
could  proceed  in  one  of  two  ways: 

•  one  could  determine  the  transforms  and  then  modify 
the  imagery,  producing  an  image  pair  having  collincar 
epipolar  geometry; 

or 


•  one  could  determine  the  transforms,  and  modify  the 
output  of  an  edge  operator  process  that  functions  over 
the  original  imagery. 


The  latter  is  by  far  the  superior  approach,  as  it  avoids  resampling 
the  image.  Tins  approach  necessitates  incorporating  the  transform 
computation  into  the  stereo  system,  to  follow  edge  finding  and  precede 
edge  matching.  ^ 


I  lie  second  part  of  the  stereo  system’s  analysis  is  an  intensity  correla¬ 
tion  process.  This  operates  along  epipolar  lines  as  well,  and  clearly 
requires  intensity  information  to  be  accessible  along  epipolar  lines.  One 
solution  to  this  would  be  to  take  the  original  image  pair  and  have  the 
correlator  rotate  and  change  shape,  size,  ami  orientation  as  it  moves 
around  the  image;  this  is  an  awkward  and  probably  unnecessary  com¬ 


plication.  An  alternative  would  be  to  access  Hie  transformed  linages, 
sampled  as  accurately  :ls  possible,  and  do  the  correlation  in  the  rectan- 
gnh.r  space  defined  by  rollmcnr  epipolar  lines.  The  argument  from  edge 
accuracy  indicated  that  transforming  edges  rather  then  resampling  the 
image  was  the  way  to  go;  this  argument  from  intensity  < orrelation  sug¬ 
gests  that  the  resampled  image  can  be  useful. 


Another  implementation  detail  supported  tl  is  use  of  both  transformed 
edges  and  transformed  imagery:  it  was  found  that  the  intensity  in¬ 
formation  available  from  the  Marimont  process  had  too  small  a  basis 
for  useful  correlation,  and  in  fact,  for  transformed  edges,  had  *ttlc 
relevance  for  the  matching  (il  being  measured  not  along  epipolar  lines 
but  normal  to  the  edge  direction).  The  transformed  image  had  to  he 
referenced  again  by  the  system  to  obtain  mole  signilicant  intensity  es¬ 
timates  oriented  along  epipolar  lines,  and  working  witli  the  image  in 
epipolar  space  facilitated  this. 


pn iiotsu puy  oi  uic  slerc< 


edge  analysis  for,  among  other  things,  its  higher  accuracy,  and  to  use 
in  tensity  analysis  for  tin:  continuity  it  provides.  To  he  consistent  with 
this,  we  wanted  to  have  the  highest  possible  accuracy  for  edges  in 
epipolar  space,  and  if  sacrilicc  be  needed  for  simplicity,  to  do  it  where 
it  least  degraded  the  analysis  -  in  the  intensity  correlation.  It  is  clear 
Unit  transformed  edges  give  higher  accuracy  than  edges  from  trans¬ 
formed  images  (detectability  might  not  change  much,  Imt  localization  is 
significantly  reduced);  and  important  simplifications  could  be  obtained 
lor  little  loss  by  doing  the  miensity  correlation  over  the  resampled 
image  pair.  This  meant  changes  in  our  plans  for  the  registration  sys¬ 
tem:  it  had  to  produce  not  just  transform  information,  but  transformed 
images  as  well.  Doth  forms  are  made  available  ils  output  from  the 
registration  program  described  in  section  3,  and  the  enhanced  linker 
system  uses  them  both. 


4.3  The  Marimont  Edge  Operator 

Tbe  Marimont  edge  operator  has  greater  detection  and  reliability  than 
the  original  liakcr  edge  operator,  and  similar  localization;  earlier  ex¬ 
amples  of  its  processing  convinced  ns  that  its  output  would  improve 
the  quality  or  our  stereo  reconstruction.  Its  ability  to  track  along  zero 
signal  areas  in  following  zero-crossing  edges  leads  to  more  coherent 
image  descriptions.  [Mariiuoiit.  1982]  provides  details  of  the  operator's 
functioning.  Roughly,  it  works  by  convolving  an  m  X  rn  lateral  inhibi¬ 
tion  liinclioii  of  n  X  n  central  window  with  an  image.  Zero  crossings 
in  tins  resultant  image  then  indicate  edges,  and  the  edge  position  is 
determined  by  interpolating  over  the  lateral  inhibition  surface. 

A  few  unanticipated  problems  became  apparent  once  work  with  the 
edges  was  begun.  One  point,  noted  above,  was  that  the  intensity 
information  stored  at  an  edge  (its  left  and  right  boundary  values)  had 
quite  small  support  (a  single  pixel).  This  is  in  contrast  with  the  original 
operator  which  interpolated  for  these  values  in  an  area  3  pixels  wide  and 
removed  one  pixel  from  the  determined  edge  position.  Another  problem 
was  that  the  edge  connectivity  produced  by  the  Marimont  system  can 
be  misleading.  Intensity  significance  was  improved  by  sampling  alone 
epipolar  lines  in  the  transformed  images.  The  connectivity  problem 
has  not  been  looked  at  yet.  Good  connectivity  is  inherently  difficult  to 
achieve  with  zero  crossing  operators.  Refinements  to  the  process  are 
nemg  considered. 

4.4  Edge  Extent 

The  introduction  of  edge  extent  as  a  parameter  in  the  dynamic  pro¬ 
gramming  solution  was  an  obvious  fallout  from  using  the  Marimont 
edges.  lodges  arc  output  hy  that  process  as  strings,  with  2- 
connee  ted  ness.  The  maximum  and  minimum  of  some  string  in  trans¬ 
form  space,  is  a  measure  or  its  (epipolar)  extent.  I’rior  to  the  use  of  this 
information  the  only  way  that  global  continuity  entered  the  analysis 
was  tii rough  a  consistency  enforcement  relaxation  process  which  en¬ 
sured  that  edges  connected  in  one  view  were  interpreted  as  continuous 
in  3-space;  all  matching  measures  were  quite  local.  With  the  modified 
approach,  the  correspondence  measure  is  a  function  of  (among  other 
more  statistically  based  parameters)  the  ratio  of  edge  extents.  In  par  ’ 
tieidar,  the  likelihood  of  edge  element  a  in  the  left  image  matching  edge 
clement  b  m  the  right  Image  depends  o.i  the  product  of  the  ratios  of 
the  two  upper  extents  (up  from  the  edge  elements)  and  the  two  lower 
extents  (down  from  the  two  edge  elements). 
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6:  References 


4.5  Testing  on  Imagery 

Wlion  i  ningc  t  os  ting  began  with  all  of  the  above  accomplished,  another 
problem  became  apparent:  the  stereo  system,  bound  into  a  machine 
architecture  with  a  maximum  of  256K  words  of  memory,  and  always 
tightly  wedged  anyway,  had  grown  with  these  changes  to  the  point  that 
only  small  portions  of  images  could  be  worked  on  at  once.  Thus  carnc 
to  exist  a  windowing  mechanism  within  the  edge  finding/loading  arid 
stereo  matching  processes. 

Our  testing  has  been  progressing  on  several  sets  of  imagery:  a  syn¬ 
thetic  image  pair  from  Control  Data  Corporation,  an  aerial  scene 
from  the  Engineering  Topographic  Laboratory,  and  a  building  scene 
of  Sacramento.  Wc  will  report  on  the  results  of  these  analyses  at  a 
later  date. 

5:  Summary 

A  principal  research  interest  of  our  group  is  in  developing  a  rule- 
based  advanced  automated  stereo  mapping  system  to  function  within 
ACRONYM  [Brooks  1981].  Current  mapping  techniques  ignore  much 
of  the  information  available  from  inference  on  single  views  of  a  scene. 
This  information  can  be  useful  for  three-dimensional  surface  interpreta¬ 
tion,  and  also  provides  extra  parameters  for  stereo  matching  (i.e.  sur¬ 
face  orientation,  occlusion  cues).  Our  research  effort  is  directed  at 
establishing  such  monocular  inference  rules  in  a  rule-base  for  stereo 
mapping. 

In  deriving  these  rules,  wc  perform  analysis  of  both  hand  extracted 
and  automatically  produced  edge  descriptions.  A  facility  has  been 
developed  for  this  manual  edge  extraction  from  hardcopy  imagery. 
Wc  have  studied  rule  synthesis  for  several  cases,  including  that  of 
orthogonal  trihedral  vertices  -  features  that  dominate  cultural  scenes. 
This  research  is  very  promising,  and  has  shown  the  utility  of  the  rulc- 
hased  approach  to  surface  inference  from  monocular  information. 

Camera  solving  provides  powerful  constraint  on  the  correspondence 
problem  in  stereo  matching.  We  have  developed  a  facility  for  interac¬ 
tively  registering  images,  determining  the  parameters  for  transforming 
them  (or  their  edge  descriptions)  into  collinear  cpipolar  space,  and  per¬ 
forming  the  actual  image  transformation.  This  determination  is  crucial 
to  a  mapping  process.  Incorporating  an  automated  module  to  provide 
data  for  the  camera  solving  is  a  very  important  next  step. 

We  have  experimented  with  an  existing  stereo  mapping  process,  en¬ 
hancing  its  flexibility  with  respect  to  image  format  and  with  respect 
to  edge  operator  format,  and  have  been  preparing  example  outputs  of 
its  processing  on  new  imagery.  Onr  intent  with  this  cfTort  has  been 
to  show  the  capabilities  of  a  local  matching  process  and  to  assess  its 
applicability  to  the  planned  rule-based  system. 
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a  Content  Addressable 
(CAAPP)  which  is  both 
Its  practicality  stems 


„A  design  is  presented  for 
^Array  Parallel  Processor 
practical  and  feasible, 
from  an  extensive  program  of  research  into  real 
applications  of  content  addressability  and 
parallelism.  The  feasibility  of  the  design  stems 
from  development  under  a  set  of  conservative 
engineering  constraints  tied  to  limitations  of  VLSI 
technology.  <Ve  then  describe  the  implementation  of 
two  procedures'  for  image  processing  on  the  CAAPP. 
The  first  performs  image  convolutions  very  quickly. 
It  is  shown  that  this  algorithm  can  be  generalized 
to  perform  convolutions  with  increased  mask  size 
with  only  a  moderate  reduction  in  speed.  The 

second  uses  the  CAAPP  to  quickly  and  robustly 
decompose  an  optic  flow  field  into  its  rotational 
and  translational  components  to  recover  sensor 
motion  parameters.  It  is  important  to  note  that 
this  latter  is  made  possible  only  by  the 

combination  of  associativity  and  array  processing 
that  our  design  provides. 


1.  0  DESIGN  OF  THE  CAAPP 


We  have  developed  a  design  for  a  VLSI-based  Content 
Addressable  Array  Parallel  Processor  (or  CAAPP)  for 
image  processing  and  other  applications.  Our 
intention  has  been  to  produce  a  feasible  design 
which  would  be  simple  enough  for  us  to  construct 
with  reasonable  confidence  of  success  but  which 
would  also  provide  a  significant  advance  in 
processing  power.  Accordingly,  a  number  of 
constraints  have  been  imposed  on  the  design.  The 
CAPP  would  have  to  consist  of  no  more  than  one 
hundred  circuit  boards  and  each  board  should  have  a 
maximum  of  one  hundred  off-board  connections. 
Additionally,  the  VLSI  chips  we  were  to  design 
would  be  restricted  to  existing  feature  and  die 
sizes,  have  a  pin-out  of  no  more  than  forty  pins, 
and  a  power  dissipation  of  less  that  two  watts. 


We  also  set  a  number  of  goals  which  we  hoped  to 
achieve.  It  was  decided  that  the  CAPP  should 
contain  262,144  cells  arranged  as  a  rectangular 
512xb12  array  to  facilitate  image  processing .  Each 
cell  would  contain  at  least  thirty-two  bits  of 
memory  ( prefer abl y  six ty-four  bits),  multiple  tags, 
and  some  bit  serial  processing  power.  One  hundred 
nanoseconds  was  set  as  a  goal  for  the  minor  clock 
cycle  time.  Additionally,  for  image  processing 
applications,  we  needed  to  be  able  to  load  the 
memory  with  an  image  in  one  video  frametime  (1/30 
second).  For  sixteen-bit  pixels  this  means  a  data 
transfer  rate  of  about  sixteen  million  bytes  per 
second.  Finally,  it  was  decided  that  the  CAPP 
would  ibe  built  to  be  driven  by  another  machine, 
such  as  a  Digital  Equipment  Corporation  VAX.  Once 
the  goals  and  constraints  were  set,  work  on  the 
design  got  under  way. 


1.  1  The  CAAPP  and  Its  Environment 


The  CAAPP  is  divided  into  two  main  parts:  the 

central  control  and  the  parallel  processor  (See 
figure  1).  The  central  control  is  a  simple,  fast, 
fetch-ahead  pipelined  processor  which  will  be  built 
from  MSI  devices.  It  issues  instructions  to  the 
parallel  processor,  controls  loading  and  unloading 
of  data  in  the  parallel  processor,  serves  as  an 
interface  to  the  VAX  or  other  host  computer  and  to 
other  data  sources  and  secondary  storage  devices. 
The  central  controller  contains  a  ROM  with  a  set  of 
micro-coded  subroutines  for  performing  commonly 
needed  higher  level  CAAPP  operations,  and  a 
writeable  control  store  which  allows  users  to  add 
their  own  special  microcoded  instructions.  Also 
contained  in  the  central  controller  is  a  small 
program  memory  into  which  subroutines  or  entire 
programs  may  be  loaded.  The  writeable  control 
store  and  program  memory  are  loaded  directly  by  the 
VAX.  Once  these  memories  are  loaded,  the  VAX  can 
issue  commands  to  the  central  controller  which  tell 
it  to  execute  routines  stored  in  the  program 
memory,  to  single  step  through  a  stored  routine,  or 
to  execute  a  single  instruction  passed  as  a  literal 
by  the  VAX. 


This  work  was  supported  in  part  by  grants  from  the 
Army  Research  Office  DAAG29-79-G-0046  and  grant 
number  N0001 4-82-K-0464  from  DARPA. 
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1.  1.2  The  Communications  Interconnect 


Figurj  1 


1.1.1  The  Parallel  Processor 


The  parallel  processor  design  consists  of  an  8x8 
array  of  processing  circuit  boards  and  a  set  of 
special  purpose  boards  which  control  how  the  edges 
of  the  CAAPP  are  treated.  The  parallel  processor 
receives  data  and  instructions  broadcast  to  it  by 
the  central  controller.  Each  parallel  processor 
instruc ticn  operates  in  exactly  one  minor  cycle 
time.  Some  operations  do  require  multiple  clock 
cycles,  but  these  are  taken  care  of  by  having  the 
central  control  rebroadcast  the  instruction  as  many 
times  as  necessary. 

Each  processor  board  consists  of  an  8x3  array  of 
special  CAAPP  integrated  circuits  plus  some  random 
buffer  logic,  uur  current  design  calls  for  all 
sixty-four  processor  circuit  boards  to  be  placed  in 
four  card  racks  (sixteen  per  rack)  and 
interconnected  by  a  broadcast  backplane  and  ribbon 
cabl  es . 

The  heart  of  the  CAAPP  design  is  a  special  purpose 
nMOS  VLSI  integrated  circuit.  Each  of  these  chips 
will  contain  sixty-four  CAAPP  cells,  an  instruction 
decoder,  and  other  miscellaneous  logic.  We  have 
designed  the  chip  with  as  much  generality  as 
possible,  knowing  that  such  generality  need  not  be 
fully  used  later  on. 


One  of  the  biggest  problems  in  designing  the  CAAPP 
was  how  to  handle  the  rectangular  interconnection 
of  the  cells.  The  number  of  wires  required  for 
such  a  network,  even  for  bit  serial  communications, 
is  staggering.  This  became  most  evident  when  we 
tried  to  design  the  IC  communications  interface. 
For  sixty-four  ceils,  the  arrangement  which  gives 
the  minimum  number  of  external  connections  is  an 
8x8  grid.  With  a  four-way  N,S,E,W  interconnect 
there  are  then  only  thirty-two  neighboring  cells  to 
connect  to.  (We  considered  an  eight-way 

N,  S,  E,  W,  NW,  HE,  SW,  SE  interconnect ,  but  were  forced 
to  abandon  it  due  to  the  wiring  complexity.)  By  the 
time  control,  power  ,  and  clock  signals  were  added 
to  the  thirty-two  neighbor  lines,  we  found  that  a 
sixty-four  pin  package  would  be  required  to  hold 
the  IC.  Further  examination  also  revealed  that 
full  interconnection  would  require  that  each 
processor  board  have  256  ribbon  cable  communication 
lines  —  in  other  words,  a  two  foot  wide  swath  of 
ribbon  cable  running  between  each  pair  of  boards! 
Because  this  violated  two  of  our  main  design 
constraints,  we  had  to  simplify  the 
interconnec  tions . 

By  8:1  multiplexing  the  communications  net  as  it 
crossed  chip  boundaries,  we  were  able  to  reduce  the 
IC  pin  count  to  twenty -two  pins  and  the  I/O  lines 
per  board  to  sixty-three  (of  which  only  thirty-two 
need  to  be  run  in  ribbon  cable,  the  rest  being 
backplane  bus  signals).  By  going  from  sixty-four 
pin  to  twenty-two  pin  packages,  the  board  size  was 
also  reduced  significantly.  Unfortunately,  all  of 
these  benefits  were  paid  for  in  a  loss  of  speed. 
The  new  interconnect  takes  0.8  microseconds  to 
transfer  one  bit  between  cells  (25.6  microseconds 
for  thirty-two  bits).  We  should  also  note  here 
that  the  CAAPP  instruction  set  makes  this 
multiplexing  transparent  to  the  user. 


1.1.3  Some/ None  Log  ic 


A  common  means  of  controlling  loop  execution  in  CAM 
algorittms  is  to  continue  processing  until  none  or 
only  one  of  the  CAM's  tag  bits  are  turned  on.  It 
is  thus  essential  that  we  have  a  fast  means  of 
determining  this.  The  simplest  way  of  doing  this 
is  to  test  whether  any  tags  are  on;  if  so,  we  find 
the  first  one  and  turn  it  off,  then  repeat  the 
some/ none  test.  The  "find-first"  operation  is  aiso 
used  frequently  when  a  search  selects  several  data 
elements  with  the  same  key  value.  Find-first  then 
provides  a  way  to  select  one  of  these  for 
processing.  Thus  the  need  is  great  for  fast 
some/none  and  find  first  operations.  On-chip  the 
Some/None  signal  j.s  determined  by  feeding  the 
output  of  the  main  tag  bit  into  a  sixty-four -way 
NOR  with  an  inverter  between  its  output  and  the 
Some/None  pad  driver.  Once  the  signal  goes 
off-chip,  it  passes  through  a  four-level  OR  tree 
before  reaching  the  central  controller. 


1.1.^  Count  Responders 


1.2  Dev  ice  Floorplan 


Many  CAM  designs  devote  much  complex  and  expensive 
hardware  to  increasing  the  speed  of  the  operation 
which  counts  the  tag  bits  that  are  turned  on.  We 
have  found,  however,  that  the  count  of  responding 
tags  is  used  primarily  as  a  way  of  gathering 
statistics  for  use  at  much  higher  levels  of 
processing  control  to  direct  the  strategic 
application  of  the  CAM.  It  is  thus  rather 
infrequently  applied  as  compared  to  operations  such 
as  comparisons  and  some/none  tests.  We  thus  feel 
that  slower,  simpler,  less  expensive  response  count 
hardware  is  quite  acceptable.  Further,  we  have 
designed  a  very  simple  response  count  system  which 
runs  only  about  an  order  of  magnitude  slower  than 
much  more  complex  designs. 

The  count  responders  operation  requires  only  three 
changes  to  be  made  to  the  CAAPP  circuitry  to  be 
feasible.  First,  it  must  be  possible  to  connect 
all  of  the  response  bits  on  a  chip  into  a  circular 
shift  register.  (This  is  also  useful  as  a  means  of 
testing  the  integrated  circuits  since  it  allows  us 
to  directly  examine  register  contents  with  a  test 
circuit.)  This  shift  register  is  easily  added 
because  the  neighbor  communication  network  already 
provides  most  of  the  necessary  links.  Secondly,  a 
register,  a  counter,  and  a  full  adder  must  be  added 
to  each  chip.  Finally,  the  cards  that  control  the 
top-bottom  edge  treatment  must  be  modified  to 
include  column  summing  registers  and  a  final  sum 
register.  The  algorithm  used  is  reasonably  fast 
(about  twenty-six  microseconds),  inexpensive,  and 
most  importantly  it  can  be  used  with  any  size  of 
array  without  having  to  modify  the  basic  IC  —  only 
the  bottom  row  circuit  board  needs  to  be  changed. 


The  unit  cells  arc  arranged  in  two  columns  of 
thirty-two  (See  figure  2).  This  arrangement  was 
chosen  because  we  found  that  the  best  compaction 
would  be  obtained  if  we  could  share  control  and 
memory  select  lines  among  as  many  cells  as 
possible.  Each  cell  is  thus  very  long  and  narrow. 
A  column  of  thirty-two  cells  is  almost  covered  by  a 
river  of  metal  control  and  select  lines  which  run 
vertically  over  it.  These  lines  are  simply 
duplicated  and  mirrored  for  the  two  columns. 
Control  is  generated  by  a  NOR-NOR  network  forming 
the  instruction  and  address  decoders.  Responder 
count  hardware  is  provided  in  a  small  block  of 
random  logic.  The  overall  size  estimate  of  the 
active  chip  area  (excluding  pads  and  drivers)  is 
2400x2400  lambda.  Thus  if  lambda  is  three  microns, 
the  central  portion  of  the  die  would  be  roughly  ?85 
mils  on  a  side.  This  is  somewhat  large,  but  not 
unreasonable.  Access  to  a  fabrication  facility 
with  slightly  smaller  feature  size  and/or  two 
layers  of  metal  would  significantly  reduce  the  die 
size.  Power  dissipation  is  estimated  at  1.5  watts, 
which  is  low  enough  to  allow  forced-air  cooling. 

1.3  The  Unit  Cells 


A  unit  cell  consists  of  thirty-two  bits  (which  will 
be  expanded  to  at  least  64  bits  in  the  final 
design)  of  fully  static  memory,  four  one-bit  static 
tag  "registers"  called  A,  B,  X,  and  Y,  and  a  static 
carry  bit  "register"  called  Z.  Each  cell  also 
contains  an  ALU  which  continuously  generates  X  nand 
Y,  X  nor  Y,  and  X  +  Y  +  Z.  Finally,  each  cell 
contains  logic  for  selecting  one  source  of  data  (a 
register,  memory,  an  ALU  function,  broadcast  data, 
or  a  neighbor  cell),  possibly  inverting  the 
selected  signal  and  storing  it  in  a  destination 
(memory  or  register).  Neighbor  communication  lines 
run  vertically  in  two  channels  in  the  middle  of  the 
cell.  The  Z  register  is  special  in  that  it  is  not 
available  for  selection  as  a  data  source.  It  can 
be  copied  directly  to  the  X  register  and  can  be 
loaded  from  the  output  of  the  selector.  It  also  is 
loaded  with  the  carry  from  X  +  Y  +  Z  whenever  that 
function  is  selected. 

The  X  register  is  special  in  that  its  output  is 
connected  to  the  some/none  logic  and  the  neighbor 
communication  network.  In  some  sense  it  is  the 
"main"  tag  bit. 

The  A  register  is  also  special.  It  controls 
whether  the  cell  is  active.  If  a  cell  is  not 
active,  it  ignores  all  instructions  broadcast  by 
the  central  controller  except  a  special  few. 

The  Y  register  is  intended  to  be  used  for  storing  a 
second  set  of  tag  bits  which  may  eventually  be 
combined  with  other  sets  through  the  logical 
operations  provided  by  the  ALU. 
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Figure  3 


The  B  register  is  intended  as  temporary  storage  for 
a  second  set  of  activity  bits,  essentially 
providing  a  single  level  of  "subroutine  call"  or  an 
alternative  activity  "screen". 

Figure  3  shows  the  logical  arrangement  of  a  unit 
cell . 


1.4  CAAPP  IC  Instruction  Set 


Each  CAAPP  IC  instruction  executes  in  one  minor 
clock  cycle  (100  ns).  This  was  done  to  avoid 
feedback  loops  in  the  decoder  on  the  chip  and  to 
avoid  special  instruction  states  in  the  central 
controller.  This  means  that  the  central  controller 
must  be  programmed  to  re-issue  some  instructions 
several  times.  For  example,  transferring  data  to 
neighbor  cells  across  chip  boundaries  requires 
eight  individual  transfers  because  of  the  8:1 
multiplex.  The  central  control  must  therefore 
issue  the  shift  instruction  eight  times  in  a  row. 

There  are  eight  basic  instructions  recognized  by 
the  chip.  Of  these,  six  are  memory  transfer 
operations  and  use  a  five-bit  address  value  (this 
will  be  expanded  to  at  least  six  with  the  increase 
in  memory  size)  to  select  the  bit  to  be  read  or 
written.  The  other  two  instructions  treat  the 
address  as  a  sub-operation  specifier.  For  the  most 
part  these  are  non-memory-data-source  to  register 
transfer  operations  with  one  op  code  causing  the 
data  to  be  inverted  before  storage  and  the  other 
causing  a  direct  transfer.  There  are  nineteen 
special  sub-ops,  however,  which  are  reserved  for 
unusual  operations  such  as  transferring  data  on  and 
off  the  chip  or  counting  responders. 


Some  operations  are  also  designated  as  "jam 
transfers".  This  means  that  they  are  performed 
regardless  of  whether  the  A  register  contains  a 
logic  one.  These  provide  a  means  of  storing  and 
retrieving  different  activity  patterns  and  of 
applying  global  operations  which  ignore  activity 
without  the  usual  overhead  of  having  to  save  the 
current  activity  pattern,  and  retrieve  it  later. 


1.5  Current  Status 


As  of  this  writing  we  have  designed  a  sixteen-cell 
(4x4)  test  chip.  Using  a  simple  set  of  three 
micron  design  rules,  we  have  succeeded  in  fitting 
the  circuitry  onto  a  180x180  mil  body  area  with 
room  to  spare.  The  actual  cell  area  occupies  only 
110x150  mils.  Estimated  power  dissipation  is  only 
350  milliwatts.  The  design  includes  about  7000 
transistors.  The  memory  portion  of  the  chip  is 
being  fabricated  as  part  of  a  student  project  by 
M0SIS.  We  are  working  closely  the  General  Electric 
Cooporation  on  a  feasability  study  examining  the 
technologies  they  could  bring  to  bear  should  we 
embark  on  a  cooperative  development  project. 


We  have  already  written  a  number  of  programs  for 
the  CAAPP  and  estimated  their  operation  times  by 
hand.  For  example,  one  special  purpose  convolution 
of  interest  in  computer  vision  processing  (a  simple 
3x3  mask)  required  only  100  microseconds  for  the 
entire  512x512  image.  (This  will  be  presented  in 
the  next  section.)  More  complex  convolutions  take 
longer,  of  course,  but  most  of  interest  can  b; 
performed  in  less  than  five  milliseconds.  We  have 
also  examined  motion  analysis  and  found  the  results 
to  be  quite  encouraging  (some  of  which  will  be 
discussed  later).  An  instruction  level  simulator 


has  been  programmed  for  the  CAAPP  and  is  currently 
being  used  to  develop  and  test  small  applications, 
and  to  gather  statistics  on  instruction  set  usage 
and  predict  execution  times.  (It  is,  of  course, 
nearly  impossible  to  test  any  really  large 
applications  on  the  simulator  because  most  of  these 
would  require  several  days  of  computer  time.)  A 
higher  level  simulator  is  also  being  integrated 
into  our  department's  computer  vision  research 
system  so  that  researchers  can  begin  to  develop 
vision  related  algorithms  for  the  CAAPP. 


2.0  IMAGE  CONVOLUTION  ON  THE  CAAPP 


Our  previous  work  on  Conway's  Game  of  Life 
implemented  on  a  CAM  [2]  demonstrated  that  such  a 
device  could  be  effectively  used  to  speed  up 
algorithms  which  dealt  with  rectangular  grids  of 
cells  and  small  neighborhoods  about  each  of  those 
cells.  Conway's  Game  of  Life  actually  involves 
performing  a  very  simple  image  convolution  and  the 
technique  developed  for  it  was  applied  to  more 
general  convolutions.  This  method  was  further 
refined  with  the  CAAPP  design. 


2.  1  Basic  Technique 


One  simple  implementation  of  convolution  involves 
each  cell  on  a  rectangular  grid  examining  its 
immediate  neighborhood  and  then  updating  its  own 
contents  based  upon  some  function  of  that 
neighborhood.  The  update  must,  of  course,  be 
performed  after  all  cells  have  finished  examining 
their  neighborhoods.  On  a  parallel  array  processor 
this  examination  can  be  performed  simultaneously  by 
all  of  the  cells  on  the  grid,  as  can  the  update 
operation.  Thus  the  algorithm  for  the  convolution 
can  be  described  as  the  actions  of  a  single  cell 
with  the  understanding  that  each  action  is 
performed  simultaneously  by  all  of  the  cells. 

There  are  two  different  ways  of  approaching  the 
problem  of  examining  the  neighborhood.  The  one 
that  first  comes  to  mind  is  that  each  cell  "looks" 
at  each  cell  in  its  neighborhood,  gathering  what 
information  it  needs  to  perform  an  update.  In 
practice  this  involves  moving  data  from  each  cell 
in  the  neighborhood  into  the  "central"  cell  where 
some  function  is  then  applied  to  it  and  the  result 
stored  for  the  update  phase  cf  the  convolution. 
The  problem  with  this  is  that  the  data  must  often 
pass  through  other  cells  before  it  reaches  the 
central  cell.  For  example,  when  the  neighborhood 
is  7x7  cells,  data  from  the  outer  ring  of  cells 
must  pass  through  at  least  two  other  cells  before 
reaching  the  center  cell.  Because  movement  or  data 
takes  time,  this  "passing  through"  is  rather 
inefficient.  The  solution  is  to  have  the  data 
stored  in  the  intermediate  cells  on  its  way  to  the 
center,  thus  taking  advantage  of  the  fact  that 
those  cells  will  also  need  to  know  the  values  in 
order  to  compute  the  function  of  their 
neighborhoods.  Although  this  will  work,  the 


algorithm  becomes  rather  messy  since  we  must  now 
consider  the  actions  of  several  cells  at  once  and 
how  these  relate  to  erch  other.  It  also  becomes  a 
complex  problem  to  determine  an  optimal  set  of  data 
collection  paths  as  the  neighborhood's  diameter 
varies . 

It  turns  out  that  the  other  approach  to  examining 
the  neighborhood  greatly  simplifies  these  problems. 
This  approach  takes  the  opposite  view  of  the 
collection  process.  Instead  of  each  cell 
collecting  all  of  the  data  frem  its  neighborhood, 
each  cell  distributes  its  own  data  to  every  cell  in 
the  neighborhood.  Because  every  other  cell  is  also 
doing  this,  the  end  result  is  that  the  central  cell 
(and  hence  all  cells)  gets  the  data  it  needs  from 
all  of  the  cells  in  the  neighborhood.  The  problem 
of  establishing  an  optimal  distribution  path  is 
trivial  for  a  square  array  of  odd  diameter:  It  is 
simply  a  rectangular  spiral  out  from  the  center 
cell.  For  even  diameter  square  neighborhoods  the 
problem  is  only  slightly  more  difficult  because  the 
center  cell  is  actually  half  of  a  cell  width  off 
center  in  two  directions.  In  this  case  it  is 
simply  required  that  the  appropriate  choice  of 
initial  direction  and  of  clockwise  or  counter 
clockwise  spiral  be  made  to  select  the  optimal 
path.  The  only  other  point  that  requires 
mentioning  is  that,  because  this  is  a  distribution 
process  rather  than  a  collection  process,  the 
function  mask  for  the  convolution  must  be  mirrored 
across  the  central  cell.  For  example,  when  the 
cell's  value  is  being  stored  in  its  north  neighbor, 
the  function  applied  to  that  value  is  the  south 
neighbor  function.  The  reason  for  this  can  be  seen 
when  it  is  realized  that  the  central  cell  is 
actually  the  south  neighbor  of  the  cell  to  its 
north.  The  mirroring  of  the  convolution  function 
mask  is  actually  quite  easy  to  accomplish:  we 
simply  step  through  the  mask  in  exactly  the 
opposite  direction  that  the  distribution  path 
takes. 


Let's  look  at  an  example:  A  simple  convolution  for 
smoothing  isolated  cells  of  noise  out  of  an  image. 
We  will  use  a  3x3  convolution  mask  in  which  the 
cell  accumulates  the  sum  of  its  neighbor's  values, 
weighted  inversely  with  distance  away  from  the 
center.  The  sum  will  then  be  normalized.  Define 
the  mask  to  be  an  array  M 
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tfhere  M  1  is  the  central  cell.  Then  the  following 
algorithm  will  perform  the  convolution  (north  is 
up,  west  is  to  the  left,  etc.): 


i  :=  1 
j  :=  1 


sum  :=  value  *M.  . 
move  value  nortlr 
i  :=  i+1 

sum  :  =  sum  +  value 
move  value  east 

j  :=  j+1 

sum  : r  sum  +  value 
move  value  south 
i  :  =  i-1 

sum  :  =  sum  +  value 
move  value  south 
i  :=  i-1 


M.  . 
ij 


M.  . 
1J 


M.  . 

IJ 


sum  :=  sum  +  value 
move  value  west 
j  =+  J-1 

sum  :  =  sum  +  value 
move  value  west 


*  M.  . 


1  j 


M.  . 

ij 


j  :  =  j-l 

sum  :=  sum  -  value 
move  value  north 
i  :=  i+1 


*  M.  . 
ij 


sun  :  =  sum  +  value  *  M 
value  sum  *  normalizing 


fac  tor 


It  should  be  noted  that  the  time  required  to 
perform  a  convolution  using  the  parallel  processor 
is  independent  of  the  size  of  the  image  and  only 
dependent  upon  the  area  of  the  convolution  mask. 
Since  the  CAAPP  does  cell  level  arithmetic 
bi t-ser i ally ,  the  size  of  the  data  values  also 
affects  the  speed  of  the  algorithm.  It  has  been 
determined  that  convolution  with  a  3x3  mask 
requires  980  CAAPP  operations.  This  corresponds  to 
98  micro-seconds  per  convolution  or  340 
convolutions  per  frame  time  or  10204  convolutions 
per  second. 

Convolutions  with  more  general  and/or  larger  masks 
will  take  longer.  A  very  rough  worst  case  estimate 
of  the  time  required  for  such  convolutions  can  be 
obtained  from  the  formula: 

T  =  P ( . 8N+.  2M  +  .  1 )  +  ,3M(N2P+N+1) 


a  totally  general  square  mask  where  the  values  car, 
change.  If  constants  are  to  be  used  for  the  mask 
values,  a  significant  speed  increase  can  be 
obtained  by  optimizing  the  multiplies  for  those 
values.  Thus,  for  example,  a  convolution  on  16  bit 
values  with  8  bit  mask  values  could  be  applied  over 
at  most  a  7x7  mask  in  one  video  frame  time  with  a 
worst  case  situation.  For  normal  situations,  it 
should  be  possible  to  convolve  a  10x19  area.  Given 
constant  mask  values,  and  depending  upon  the  amount 
of  optimization  possible,  even  a  25x25  mask  could 
be  done  in  one  video  frame  time. 

As  a  final  note,  this  method  is  not  restricted  to 
square  masks  and  in  fact  should  be  readily 
generali  zeable  to  any  mask  shape.  All  that  is 
required  for  this  is  an  algorithm  for  efficiently 
shifting  the  center  cell's  value  so  that  it  covers 
the  mask  area.  Thus  it  should  be  possible  to 
easily  adapt  it  to  such  mask  shapes  as  annuli  and 
disjoint  areas. 


3.0  OPTIC  FLOW  FIELD  DECOMPOSITION  ON  THE  CAAPP 


The  reali  zation  of  motion  perception  in  artificial 
systems  will  require  highly  parallel  architectures. 
We  have  explored  the  use  of  the  CAAPP  as  an 
effective  means  of  quickly  and  robustly  decomposing 
a  flow  field  into  its  rotational  and  translational 
components  [4,5]  to  recover  the  parameters  of 
sensor  motion. 

Die  algorithm  is  an  exhaustive  search  procedure 
which  uses  a  set  of  rotational  and  translational 
flow  field  templates  to  find  a  component  pair  which 
can  account  for  the  motion  depicted  in  a  given  flow 
field.  Currently,  1000  rotational  templates  and 
200  translational  templates  are  used.  These  are 
generated  from  100  axes  which  are  uniformly 
distributed  with  respect  to  a  unit  hemisphere,  and 
all  pass  through  the  origin  of  the  sensor 
coordinate  system.  Each  flow  field  consists  of 
16x16  vectors  and  is  stored  on  a  2x2  square  of 
chips  consisting  of  256  cells.  The  2x2  chip 
arrangement  facilitates  flow  field  addressing. 
Each  cell  contains  the  horizontal  and  vertical 
components  of  a  flow  vector,  each  specified  with  10 
bits  of  precision. 


where 

T  =  time  in  microseconds 
N  =  number  of  bits  in  a  pixel 
M  r  number  of  bits  in  a  mask  value 
P  -  number  of  pixels  in  the  mask  area 

This  is  a  worst  case  time  which  assumes  that  all  of 
the  bits  in  all  of  the  mask  values  are  ones  (since 
this  gives  the  slowest  multiply  speed).  Under 
normal  c  ire  instances ,  T  will  be  about  half  of  the 
value  obtained  from  the  formula.  This  also  assumes 


The  algorithm  consists  of  four  basic  steps. 

(0)  The  rotational  templates  are  loaded  into  the 
CAAPP,  one  template  for  each  flow  field  location. 
Each  flow  field  location  corresponds  to  one  of  the 
squares  in  the  CAAPP  diagrams  shown  in  Figures  5a, 
5b,  and  5c.  The  rotational  templates  need  only  be 
loaded  once  since  they  are  used  in  determining  any 
flow  field  decomposition . 

(1)  A  copy  of  the  input  flow  field  is  loaded  into 
each  flow  field  location  in  the  CAAPP.  Figure  4a 
and  4b  show  two  sample  input  fields,  both  produced 
by  the  same  motion  and  environment,  except  that 
Figure  4b  was  produced  by  adding  random  spike  noise 
to  Figure  4a. 
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(2)  A  set  of  difference  fields  is  formed  by 
subtracting  each  rotational  template  from  the  copy 
of  the  input  flow  field  stored  with  it.  For  each 
resulting  difference  field,  the  slope  of  each 
difference  vector  is  computed  by  dividing  the 
vertical  component  by  the  horizontal  component. 
These  subtraction  and  division  procedures  arc 
performed  in  parallel  across  all  flow  fields 
represented  in  the  CAAPP. 

(3)  The  similarity  between  the  difference  fields 
and  each  of  the  translational  templates  is 
evaluated,  proceeding  sequentially  through  all  the 
translational  templates.  For  a  given  translational 
template,  this  comparison  is  done  in  parallel  with 
all  difference  fields  stored  in  the  CAAPP  and 
consists  of  the  following  steps: 

(3a)  The  slope  of  each  component  vector  of  the 
translational  template  is  loaded  into  the 
corresponding  vector  location  of  each  difference 
field  .  The  sign  of  the  slope  of  each  difference 
vector  is  compared  with  the  sign  of  the  slope  of 
the  corresponding  translational  template  vector. 
If  the  signs  agree,  the  absolute  value  of  the 
difference  between  the  slope  of  the  difference 
vector  and  the  slope  of  the  translational  template 
vector  is  computed,  and  then  scaled  according  to 
the  absolute  value  of  both  slopes.  If  the  scaled 
slope  difference  does  not  exceed  a  predetermined 
maximum  error  value,  then  a  vector  match  is 
designated  at  that  position.  Ihe  quantity  of  error 
permitted  here  allows  the  algorithm  to  be  resistant 
to  uniformly  distributed  Gaussian  noise  of  low 
variance  present  in  the  original  flow  field. 


(3b)  For  each  difference  field  the  number  of  vector 
slope  matches  is  counted.  If  this  sum  exceeds  a 
predetermined  minimum  number  of  matches  (in  our 
implementation,  75%  of  the  field  size),  then  the 
associated  rotational  and  translational  templates 
become  a  candidate  pair  for  the  flow  field 
decomposition.  Utilization  of  a  minimum  number  of 
required  matches  ensures  that  only  templates  which 
are  reasonably  close  to  the  actual  motion  will  be 
chosen  and  permits  some  resistance  to  random  spike 
noise.  Figure  5a  shows,  for  difference  fields 
resulting  from  the  input  field  in  Figure  4a,  the 
CAAPP  response  to  the  translational  template  which 
is  closest  to  the  actual  translational  motion. 
Each  black  dot,  within  a  square  represents  a 
position  in  a  difference  field  at  which  the  slope 
of  the  difference  vector  matches  the  slope  of  the 
translational  template.  Figure  5b  shows,  for 
difference  fields  resulting  from  the  input  field  in 
Figure  4b,  the  CAAPP  response  to  the  translational 
template  which  is  closest  to  the  actual 
translational  motion.  Figure  5c  shows  the  CAAPP 
response  to  a  translational  template  which  is  not 
close  to  the  actual  translational  motion.  This 
incorrect  translational  template  is  shown  in  Figure 
6. 


(3c)  For  all  difference  fields  yielding  at  least 
the  required  minimun  number  of  matches,  the 
variance  of  the  scaled  slope  difference  is 
computed,  and  the  difference  field  with  the  minimum 
variance  is  determined.  This  value  is  compared  to 
the  minimiif]  variance  found  from  processing  the 
preceding  translational  templates.  If  this  value 
is  less  than  the  preceding  minimum,  it  becomes  the 
new  global  minimum,  and  the  rotational  template 
associated  with  the  difference  field  together  with 
the  current  translational  template  become  the 
current  best  candidate  pair  for  the  flow  field 
dec  an  position. 

Steps  3a,  3b,  and  3c  are  performed  for  each 
translational  template. 

(4)  The  flow  field  decomposition  considered  to  be 
the  best  is  the  rotational  and  translational 
tempi  at}  pair  resulting  in  the  difference  field 
yielding  at  least  the  required  minimum  number  of 
matches  and  the  least  slope  difference  variance. 
Utilizing  minimum  variance  instead  of  the  maximum 
number  f  latches,  the  algorithm  has  achieved 
better  results,  carticularly  for  motions  whose 
component  parts  lie  between  sets  of  templates. 
Figures  7a  and  7b  show  the  rotational  and 

translational  template's  selected  by  the  algorithm 
in  the  presence  of  and  in  the  absence  of  noise,  for 
the  input  fields  in  Figures  4a  and  4b.  These 

templates  are  the  closest  ones  to  the  actual 

motions.  Figures  8a  and  8b  show  the  diff}'ence 

fields  resulting  from  subtracting  the  rotational 
motion  in  7a  from  the  original  fields  in  Figures  4a 
and  4b  respectively. 

3.  1  Flow  Field  Decomposition  Ex  per  imeuts 

• 

Experiments  have  been  performed  with  a  CAAPP 
simulator  on  a  VAX  1  1/780  using  a  wide  variety  of 
motions  and  simulated  environments.  In  all  cases 
examined,  the  translational  template  closest  to  the 
actual  translational  motion  was  selected.  The 
rotational  template  was  always  close  to  the  actual 
rotational  motion,  but  was  sometimes  not  the 
closest  template,  The  procedure  proved  to  be 
resistant  to  limited  Gaus-si  ii  misj  as  well  as  to 
limited  random  spike  noise  in  the  original  flow 
field.  Applying  motion  to  points  at  random  depths 
produced  results  similar  to  those  obtained  in  the 
no  iso  experiments.  The  algorithm's  performance 
degraded  slightly  if  each  flow  vector  component  was 
specified  by  eight  bits  of  precision  instead  of  by 
ten . 

The  CAAPP  timing  calculations  revealed  that  the 
algorithm  could  perform  the 

rotational-translational  decomposition  in  slightly 
more  than  1/4  second.  If  two  CAAPPs  are  used  in 
parallel,  then  the  time  can  be  reduced  to  less  than 
1/5  second,  since  only  half  of  the  translational 
templates  need  be  teste  i  on  each  CAAPP.  Given 
fabrication  techniques  available  in  the  immediate 


342 


future,  we  expect  execution  times  to  be 
significantly  improved.  We  suspect  that 
performance  will  improve  and  be  applicable  to  lore 
realistic  image  sequences  by  increasing  both  the 
number  and  size  of  the  ru tational  and  translational 
templates.  This  amounts  to  utilizing  more  CAAPPs 
in  parallel. 
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4.  0  CURRENT  RESEARCH 


Based  on  the  n.a.acits  of  our  test  chip  experience, 
we  intend  to  proceed  to  full  sixty-four  cell  ICs 
and,  eventually,  construction  of  the  entire 
machine.  Architectural  changes  which  we  intend  to 
pursue  ace  increasing  the  memory  size  to  at  .  'ast 
sixty-four  bits  per  cell  and  perhaps  going  to  an 
8:2  communications  multiplex  (with  a  twenty-eight 
pin  package)  for  a  doubling  in  the  data  transfer 
rate . 


Our  work  thus  far  has  indicated  that  a  Oonte-it 
Addressable  Array  Parallel  Processor  is  well  suited 
for  many  aspects  of  image  processing,  vision,  and 
motion  analysis.  We  are  exploring  the  effective 
implementation  of  a  wide  range  of  image  pro.’  eg 
algorithms  on  the  CAAPP  [6].  We  intend  to  pursue 
further  applications  in  these  areas  and  also  in  new 
areas  such  as  tactile  object  recognition  in 
robotics . 


5.  0  CONCLUSIONS 


The  key  feature  of  this  design  is  its  integration 
of  associativity  with  array  processing.  The  result 
does  well  what  each  of  these  architectures  normally 
can  do  individually  but  additionally  may  be  applied 
in  a  number  of  ways  that  can  only  be  approached  by 
the  integrated  combination.  Thus  it  becomes 
possible  to  perform  both  low  level  (such  as  image 
convolutions)  and  high  ’  evel  (such  as  real-time 
LISP  [7])  processing  in  the  same  machine. 

The  importance  of  using  a  conservative  approach  to 
development  cannot  be  over  emphasized.  Too  often 
designs  for  computers  have  been  fielded  which  push 
technology  too  far.  The  result  has  usually  been  a 
single  machine  which  is  nearly  impossible  to  keep 
running  and  too  costly  to  be  replicated.  By 
keeping  our  design  conservative,  we  hope  to  produce 
a  machine  that  is  both  useable  and  replicable  at  a 
reasonable  cost  (on  the  same  order  as  a 
mini-mainframe).  This  would  then  make  it  possible 
for  other  research  facilities  to  have  similar 
machines  without  the  extra  cost  of  development. 
The  study  of  parallelism  and  its  applications  can 
then  be  advanced  by  providing  the  research 
community  with  a  useable  standardized  parallel 
proces  sor . 


A  design  has  been  presented  for  a  Content 
Addressable  Array  Parallel  Processor  suitable  for 
both  general  use  and  image  processing  applications. 
The  architecture  of  the  processor  is  based  in 
practical  experience  and  the  hardware  design  has 
been  constrained  to  make  it  possible  to  construct 
using  existing  technology  and  with  a  high 
confidence  of  success.  Despite  these  constraints, 
simulations  have  shown  that  such  a  machine  would 
provide  a  significant  increase  in  processing  power 
over  what  is  presently  available. 


A  method  has  been  shown  which  can  be  used  to 
program  the  Content  Addressable  Array  Parallel 
Processor  to  perform  image  convolutions  simply  and 
efficiently.  Such  a  program,  for  a  simple 
convolution,  was  shown  which  operates  in 
ninety-eight  microseconds.  The  time  of  the 
algorithm  is  independent  of  the  size  of  the  image 
and  depends  only  upon  the  size  of  the  mask  and,  for 
bit  serial  processing,  upon  the  number  of  bits  in 
the  pixel  and  mask  values.  A  formula  was  given  for 
a  worst  case  time  estimate  and  a  factor  for 
estimating  normal  ease  time  from  this  was 
discussed.  It  was  also  noted  that  the  method  could 
be  applied  to  masks  of  other  than  square  shapes. 

We  have  also  shown  how  the  CAAPP  may  be  used  in  the 
analysis  of  motion  from  image  sequences.  For 
certain  applications,  the  parallelism  provided  oy 
the  CAAPP  would  make  it  possible  to  perform  such 
analysis  robustly  and  nearly  at  video  frame  rate. 
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